Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
353 lines
11 KiB
Markdown
353 lines
11 KiB
Markdown
# Cluster Coordination: Domain Model Index
|
|
|
|
This directory contains a complete Domain-Driven Design model for the Cluster Coordination bounded context in Aether. Use this index to navigate the documentation.
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
**Start here if you're new to this analysis:**
|
|
|
|
1. Read [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) **Summary** section (1-2 min)
|
|
2. Skim the **Invariants** section to understand the constraints (2 min)
|
|
3. Read [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) **Overview: Code vs Domain Model** (5 min)
|
|
4. Choose your next step based on your role (see below)
|
|
|
|
---
|
|
|
|
## Documents Overview
|
|
|
|
### [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Comprehensive DDD Model
|
|
**What:** Complete tactical DDD model with aggregates, commands, events, policies, read models
|
|
|
|
**Contains:**
|
|
- Cluster Coordination context summary
|
|
- 5 core invariants (single leader, shard coverage, etc.)
|
|
- 3 root aggregates: Cluster, LeadershipLease, ShardAssignment
|
|
- 6 commands: JoinCluster, ElectLeader, MarkNodeFailed, etc.
|
|
- 11 events: NodeJoined, LeaderElected, ShardMigrated, etc.
|
|
- 10 policies: Single Leader Policy, Lease Renewal Policy, etc.
|
|
- 5 read models: GetClusterTopology, GetLeader, GetShardAssignments, etc.
|
|
- 4 value objects: NodeInfo, ShardMap, LeadershipLease, Term
|
|
- Code analysis comparing intended vs actual implementation
|
|
- 7 refactoring issues with impact assessment
|
|
- Testing strategy (unit, integration, chaos tests)
|
|
- Boundary conditions and limitations
|
|
- Alignment with product vision
|
|
|
|
**Best for:** Understanding the complete domain model, identifying what needs to change
|
|
|
|
**Time:** 30-40 minutes for thorough read
|
|
|
|
---
|
|
|
|
### [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation Roadmap
|
|
**What:** Prioritized refactoring plan with 4-phase implementation strategy
|
|
|
|
**Contains:**
|
|
- Current state vs intended state (what's working, what's broken)
|
|
- Gap analysis (6 major gaps identified)
|
|
- Priority matrix (High/Medium/Low priority issues)
|
|
- 4-phase refactoring plan:
|
|
- Phase 1: Extract cluster commands (Week 1)
|
|
- Phase 2: Publish domain events (Week 2)
|
|
- Phase 3: Implement real rebalancing (Week 3)
|
|
- Phase 4: Unify shard invariants (Week 4)
|
|
- Code examples for each phase
|
|
- Testing checklist
|
|
- Success metrics
|
|
- Integration with other contexts
|
|
|
|
**Best for:** Planning implementation, deciding what to do first, estimating effort
|
|
|
|
**Time:** 20-30 minutes for full review
|
|
|
|
---
|
|
|
|
### [ARCHITECTURE.md](./ARCHITECTURE.md) - Visual Reference & Decision Trees
|
|
**What:** Diagrams, flowcharts, and decision trees for understanding cluster behavior
|
|
|
|
**Contains:**
|
|
- High-level architecture diagram
|
|
- Aggregate boundaries diagram
|
|
- 3 command flow diagrams with decision points
|
|
- 3 decision trees (Is node healthy? Should rebalance? Can assign shard?)
|
|
- State transition diagrams (cluster, node, leadership)
|
|
- Concurrency model and thread safety explanation
|
|
- Event sequences with timelines
|
|
- Configuration parameters and tuning guide
|
|
- Failure scenarios & recovery procedures
|
|
- Monitoring & observability metrics
|
|
- Alerts and SLOs
|
|
|
|
**Best for:** Understanding how the system works, debugging issues, planning changes
|
|
|
|
**Time:** 20-30 minutes; skim decision trees as needed
|
|
|
|
---
|
|
|
|
### [PATTERNS.md](./PATTERNS.md) - Code Patterns & Examples
|
|
**What:** Side-by-side code comparisons showing how to evolve the implementation
|
|
|
|
**Contains:**
|
|
- 6 refactoring patterns with current vs intended code:
|
|
1. Commands vs Message Handlers
|
|
2. Value Objects vs Primitives
|
|
3. Event Publishing (no events → explicit events)
|
|
4. Invariant Validation (scattered → centralized)
|
|
5. Rebalancing Strategy (stubbed → real implementation)
|
|
6. Testing Aggregates (hard to test → testable with mocks)
|
|
- Full code examples for each pattern
|
|
- Benefits of each approach
|
|
- Mock implementations for testing
|
|
|
|
**Best for:** Developers writing the refactoring code, understanding specific patterns
|
|
|
|
**Time:** 30-40 minutes to read all examples
|
|
|
|
---
|
|
|
|
## Navigation by Role
|
|
|
|
### Product Manager / Tech Lead
|
|
**Goal:** Understand what needs to change and why
|
|
|
|
1. Read REFACTORING_SUMMARY.md **Overview** (5 min)
|
|
2. Read REFACTORING_SUMMARY.md **Refactoring Priority Matrix** (3 min)
|
|
3. Read REFACTORING_SUMMARY.md **Refactoring Plan** - Phase 1 only (5 min)
|
|
4. Decide: Which phases to commit to? Which timeline?
|
|
|
|
**Time:** 15 minutes
|
|
|
|
---
|
|
|
|
### Developer (Implementing Refactoring)
|
|
**Goal:** Understand how to write the code
|
|
|
|
1. Skim DOMAIN_MODEL.md **Summary** (2 min)
|
|
2. Read DOMAIN_MODEL.md **Invariants** (5 min) - what must never break?
|
|
3. Read DOMAIN_MODEL.md **Aggregates** (5 min) - who owns what?
|
|
4. Read DOMAIN_MODEL.md **Commands** (5 min) - what actions are there?
|
|
5. Read PATTERNS.md sections relevant to your phase (10-20 min)
|
|
6. Refer to ARCHITECTURE.md **Decision Trees** as you code (on-demand)
|
|
|
|
**Time:** 30-50 minutes of reading; then 2-8 hours of coding per phase
|
|
|
|
---
|
|
|
|
### Architect / Design Reviewer
|
|
**Goal:** Validate the domain model and refactoring plan
|
|
|
|
1. Read DOMAIN_MODEL.md completely (40 min)
|
|
2. Review REFACTORING_SUMMARY.md **Current State** (10 min)
|
|
3. Scan ARCHITECTURE.md diagrams (10 min)
|
|
4. Review PATTERNS.md for code quality (15 min)
|
|
5. Provide feedback on:
|
|
- Are the invariants correct and complete?
|
|
- Are the aggregate boundaries clear?
|
|
- Is the refactoring plan realistic?
|
|
- Are we missing any patterns?
|
|
|
|
**Time:** 60-90 minutes
|
|
|
|
---
|
|
|
|
### QA / Tester
|
|
**Goal:** Understand what to test
|
|
|
|
1. Read DOMAIN_MODEL.md **Testing Strategy** (5 min)
|
|
2. Read REFACTORING_SUMMARY.md **Testing Checklist** (5 min)
|
|
3. Read ARCHITECTURE.md **Failure Scenarios** (10 min)
|
|
4. Read PATTERNS.md **Pattern 6: Testing Aggregates** (15 min)
|
|
5. Create test plan covering:
|
|
- Unit tests for commands
|
|
- Integration tests for full scenarios
|
|
- Chaos tests for resilience
|
|
|
|
**Time:** 40 minutes of planning; then test writing
|
|
|
|
---
|
|
|
|
### Operator / DevOps
|
|
**Goal:** Understand how to monitor and operate
|
|
|
|
1. Read ARCHITECTURE.md **Monitoring & Observability** (10 min)
|
|
2. Read ARCHITECTURE.md **Configuration & Tuning** (10 min)
|
|
3. Read ARCHITECTURE.md **Failure Scenarios** (15 min)
|
|
4. Plan:
|
|
- Which metrics to export?
|
|
- Which alerts to set?
|
|
- How to detect issues?
|
|
- How to recover?
|
|
|
|
**Time:** 35 minutes
|
|
|
|
---
|
|
|
|
## Key Concepts
|
|
|
|
### Invariants
|
|
Business rules that must NEVER be violated. The core of the domain model.
|
|
|
|
- **I1:** At most one leader per term
|
|
- **I2:** All active shards have owners
|
|
- **I3:** Shards only assigned to healthy nodes
|
|
- **I4:** Shard assignments stable during lease
|
|
- **I5:** Leader is an active node
|
|
|
|
### Aggregates
|
|
Clusters of entities enforcing invariants. Root aggregates own state changes.
|
|
|
|
- **Cluster** (root) - owns topology, shard assignments
|
|
- **LeadershipLease** (root) - owns leadership
|
|
- **ShardAssignment** (root) - owns shard-to-node mappings
|
|
|
|
### Commands
|
|
Explicit intent to change state. Named with domain language.
|
|
|
|
- JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards
|
|
|
|
### Events
|
|
Facts that happened. Published after successful commands.
|
|
|
|
- NodeJoined, NodeFailed, LeaderElected, ShardAssigned, ShardMigrated
|
|
|
|
### Policies
|
|
Automated reactions. Connect events to commands.
|
|
|
|
- "When NodeJoined then RebalanceShards"
|
|
- "When LeadershipLost then ElectLeader"
|
|
|
|
---
|
|
|
|
## Glossary
|
|
|
|
| Term | Definition |
|
|
|------|-----------|
|
|
| Bounded Context | A boundary within which a domain model is consistent (Cluster Coordination) |
|
|
| Aggregate | A cluster of entities enforcing business invariants; transactional boundary |
|
|
| Aggregate Root | The only entity in an aggregate that external code references |
|
|
| Invariant | A business rule that must always be true |
|
|
| Command | A request to change state (intent-driven) |
|
|
| Event | A fact that happened in the past (immutable) |
|
|
| Policy | An automated reaction to events; connects contexts |
|
|
| Read Model | A projection of state optimized for queries (no invariants) |
|
|
| Value Object | Immutable object defined by attributes, not identity |
|
|
| CQRS | Command Query Responsibility Segregation (commands change state; queries read state) |
|
|
| Event Sourcing | Storing events as source of truth; state is derived by replay |
|
|
|
|
---
|
|
|
|
## Related Context Maps
|
|
|
|
**Upstream (External Dependencies):**
|
|
- **NATS** - Provides pub/sub, KV store, JetStream
|
|
- **Local Runtime** - Executes actors on this node
|
|
- **Event Store** - Persists cluster events (optional)
|
|
|
|
**Downstream (Consumers):**
|
|
- **Actor Runtime Context** - Migrates actors when shards move (reacts to ShardMigrated)
|
|
- **Monitoring Context** - Tracks health and events (subscribes to topology events)
|
|
- **Audit Context** - Records all topology changes (subscribes to all events)
|
|
|
|
---
|
|
|
|
## Quick Reference: Decision Trees
|
|
|
|
### Is a node healthy?
|
|
```
|
|
Node found? → Check status (Active|Draining|Failed)
|
|
→ Active/Draining? → YES
|
|
→ Failed? → NO
|
|
```
|
|
|
|
### Should we rebalance?
|
|
```
|
|
Leader? → YES
|
|
Active nodes? → YES
|
|
Strategy.Rebalance() → returns new ShardMap
|
|
Validate invariants? → YES
|
|
Publish ShardMigrated events
|
|
```
|
|
|
|
### Can we assign shard to node?
|
|
```
|
|
Node exists? → YES
|
|
Status active? → YES
|
|
Replication < max? → YES
|
|
Add node to shard's replica list
|
|
```
|
|
|
|
See [ARCHITECTURE.md](./ARCHITECTURE.md) for full decision trees.
|
|
|
|
---
|
|
|
|
## Testing Resources
|
|
|
|
**Test Coverage Map:**
|
|
- Unit tests: Commands, invariants, value objects
|
|
- Integration tests: Full scenarios (node join, node fail, rebalance)
|
|
- Chaos tests: Partitions, cascading failures, high churn
|
|
|
|
See [PATTERNS.md](./PATTERNS.md) **Pattern 6** for testing patterns and mocks.
|
|
|
|
---
|
|
|
|
## Common Questions
|
|
|
|
**Q: Why not use Raft for leader election?**
|
|
A: Lease-based election is simpler and sufficient for our use case. Raft would be safer but more complex. See DOMAIN_MODEL.md **Design Decisions**.
|
|
|
|
**Q: What if a leader fails mid-rebalance?**
|
|
A: New leader will detect incomplete rebalancing and may redo it. This is acceptable (idempotent). See ARCHITECTURE.md **Failure Scenarios**.
|
|
|
|
**Q: How many shards should we use?**
|
|
A: Default 1024 provides good granularity. Tune based on your cluster size. See ARCHITECTURE.md **Configuration & Tuning**.
|
|
|
|
**Q: Can actors be lost during rebalancing?**
|
|
A: No, if the application correctly implements actor migration. See DOMAIN_MODEL.md **Gaps**.
|
|
|
|
**Q: Is eventual consistency acceptable?**
|
|
A: Yes for topology (replicas lag leader by ~100ms). Leadership is strongly consistent (atomic operations). See DOMAIN_MODEL.md **Policies**.
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
- [ ] Read DOMAIN_MODEL.md Summary + Invariants
|
|
- [ ] Read REFACTORING_SUMMARY.md Overview
|
|
- [ ] Review PATTERNS.md for Phase 1
|
|
- [ ] Implement Phase 1 commands (JoinCluster, MarkNodeFailed)
|
|
- [ ] Add tests for Phase 1
|
|
- [ ] Code review
|
|
- [ ] Merge Phase 1
|
|
- [ ] Repeat for Phases 2-4
|
|
|
|
---
|
|
|
|
## Document Version History
|
|
|
|
| Version | Date | Changes |
|
|
|---------|------|---------|
|
|
| 1.0 | 2026-01-12 | Initial domain model created |
|
|
|
|
---
|
|
|
|
## Contact & Questions
|
|
|
|
For questions about this domain model:
|
|
- **Domain modeling:** Refer to DOMAIN_MODEL.md Invariants & Aggregates sections
|
|
- **Implementation:** Refer to PATTERNS.md for code examples
|
|
- **Architecture:** Refer to ARCHITECTURE.md for system design
|
|
- **Refactoring plan:** Refer to REFACTORING_SUMMARY.md for priorities
|
|
|
|
---
|
|
|
|
## Additional Resources
|
|
|
|
- [Vision](../vision.md) - Product vision for Aether
|
|
- [Project Structure](../README.md) - How this repository is organized
|
|
- [Event Sourcing Guide](../event.go) - Event and EventStore interface
|
|
- [NATS Documentation](https://docs.nats.io) - NATS pub/sub and JetStream
|
|
|