Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
352
.product-strategy/cluster/INDEX.md
Normal file
352
.product-strategy/cluster/INDEX.md
Normal file
@@ -0,0 +1,352 @@
|
||||
# Cluster Coordination: Domain Model Index
|
||||
|
||||
This directory contains a complete Domain-Driven Design model for the Cluster Coordination bounded context in Aether. Use this index to navigate the documentation.
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
**Start here if you're new to this analysis:**
|
||||
|
||||
1. Read [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) **Summary** section (1-2 min)
|
||||
2. Skim the **Invariants** section to understand the constraints (2 min)
|
||||
3. Read [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) **Overview: Code vs Domain Model** (5 min)
|
||||
4. Choose your next step based on your role (see below)
|
||||
|
||||
---
|
||||
|
||||
## Documents Overview
|
||||
|
||||
### [DOMAIN_MODEL.md](./DOMAIN_MODEL.md) - Comprehensive DDD Model
|
||||
**What:** Complete tactical DDD model with aggregates, commands, events, policies, read models
|
||||
|
||||
**Contains:**
|
||||
- Cluster Coordination context summary
|
||||
- 5 core invariants (single leader, shard coverage, etc.)
|
||||
- 3 root aggregates: Cluster, LeadershipLease, ShardAssignment
|
||||
- 6 commands: JoinCluster, ElectLeader, MarkNodeFailed, etc.
|
||||
- 11 events: NodeJoined, LeaderElected, ShardMigrated, etc.
|
||||
- 10 policies: Single Leader Policy, Lease Renewal Policy, etc.
|
||||
- 5 read models: GetClusterTopology, GetLeader, GetShardAssignments, etc.
|
||||
- 4 value objects: NodeInfo, ShardMap, LeadershipLease, Term
|
||||
- Code analysis comparing intended vs actual implementation
|
||||
- 7 refactoring issues with impact assessment
|
||||
- Testing strategy (unit, integration, chaos tests)
|
||||
- Boundary conditions and limitations
|
||||
- Alignment with product vision
|
||||
|
||||
**Best for:** Understanding the complete domain model, identifying what needs to change
|
||||
|
||||
**Time:** 30-40 minutes for thorough read
|
||||
|
||||
---
|
||||
|
||||
### [REFACTORING_SUMMARY.md](./REFACTORING_SUMMARY.md) - Implementation Roadmap
|
||||
**What:** Prioritized refactoring plan with 4-phase implementation strategy
|
||||
|
||||
**Contains:**
|
||||
- Current state vs intended state (what's working, what's broken)
|
||||
- Gap analysis (6 major gaps identified)
|
||||
- Priority matrix (High/Medium/Low priority issues)
|
||||
- 4-phase refactoring plan:
|
||||
- Phase 1: Extract cluster commands (Week 1)
|
||||
- Phase 2: Publish domain events (Week 2)
|
||||
- Phase 3: Implement real rebalancing (Week 3)
|
||||
- Phase 4: Unify shard invariants (Week 4)
|
||||
- Code examples for each phase
|
||||
- Testing checklist
|
||||
- Success metrics
|
||||
- Integration with other contexts
|
||||
|
||||
**Best for:** Planning implementation, deciding what to do first, estimating effort
|
||||
|
||||
**Time:** 20-30 minutes for full review
|
||||
|
||||
---
|
||||
|
||||
### [ARCHITECTURE.md](./ARCHITECTURE.md) - Visual Reference & Decision Trees
|
||||
**What:** Diagrams, flowcharts, and decision trees for understanding cluster behavior
|
||||
|
||||
**Contains:**
|
||||
- High-level architecture diagram
|
||||
- Aggregate boundaries diagram
|
||||
- 3 command flow diagrams with decision points
|
||||
- 3 decision trees (Is node healthy? Should rebalance? Can assign shard?)
|
||||
- State transition diagrams (cluster, node, leadership)
|
||||
- Concurrency model and thread safety explanation
|
||||
- Event sequences with timelines
|
||||
- Configuration parameters and tuning guide
|
||||
- Failure scenarios & recovery procedures
|
||||
- Monitoring & observability metrics
|
||||
- Alerts and SLOs
|
||||
|
||||
**Best for:** Understanding how the system works, debugging issues, planning changes
|
||||
|
||||
**Time:** 20-30 minutes; skim decision trees as needed
|
||||
|
||||
---
|
||||
|
||||
### [PATTERNS.md](./PATTERNS.md) - Code Patterns & Examples
|
||||
**What:** Side-by-side code comparisons showing how to evolve the implementation
|
||||
|
||||
**Contains:**
|
||||
- 6 refactoring patterns with current vs intended code:
|
||||
1. Commands vs Message Handlers
|
||||
2. Value Objects vs Primitives
|
||||
3. Event Publishing (no events → explicit events)
|
||||
4. Invariant Validation (scattered → centralized)
|
||||
5. Rebalancing Strategy (stubbed → real implementation)
|
||||
6. Testing Aggregates (hard to test → testable with mocks)
|
||||
- Full code examples for each pattern
|
||||
- Benefits of each approach
|
||||
- Mock implementations for testing
|
||||
|
||||
**Best for:** Developers writing the refactoring code, understanding specific patterns
|
||||
|
||||
**Time:** 30-40 minutes to read all examples
|
||||
|
||||
---
|
||||
|
||||
## Navigation by Role
|
||||
|
||||
### Product Manager / Tech Lead
|
||||
**Goal:** Understand what needs to change and why
|
||||
|
||||
1. Read REFACTORING_SUMMARY.md **Overview** (5 min)
|
||||
2. Read REFACTORING_SUMMARY.md **Refactoring Priority Matrix** (3 min)
|
||||
3. Read REFACTORING_SUMMARY.md **Refactoring Plan** - Phase 1 only (5 min)
|
||||
4. Decide: Which phases to commit to? Which timeline?
|
||||
|
||||
**Time:** 15 minutes
|
||||
|
||||
---
|
||||
|
||||
### Developer (Implementing Refactoring)
|
||||
**Goal:** Understand how to write the code
|
||||
|
||||
1. Skim DOMAIN_MODEL.md **Summary** (2 min)
|
||||
2. Read DOMAIN_MODEL.md **Invariants** (5 min) - what must never break?
|
||||
3. Read DOMAIN_MODEL.md **Aggregates** (5 min) - who owns what?
|
||||
4. Read DOMAIN_MODEL.md **Commands** (5 min) - what actions are there?
|
||||
5. Read PATTERNS.md sections relevant to your phase (10-20 min)
|
||||
6. Refer to ARCHITECTURE.md **Decision Trees** as you code (on-demand)
|
||||
|
||||
**Time:** 30-50 minutes of reading; then 2-8 hours of coding per phase
|
||||
|
||||
---
|
||||
|
||||
### Architect / Design Reviewer
|
||||
**Goal:** Validate the domain model and refactoring plan
|
||||
|
||||
1. Read DOMAIN_MODEL.md completely (40 min)
|
||||
2. Review REFACTORING_SUMMARY.md **Current State** (10 min)
|
||||
3. Scan ARCHITECTURE.md diagrams (10 min)
|
||||
4. Review PATTERNS.md for code quality (15 min)
|
||||
5. Provide feedback on:
|
||||
- Are the invariants correct and complete?
|
||||
- Are the aggregate boundaries clear?
|
||||
- Is the refactoring plan realistic?
|
||||
- Are we missing any patterns?
|
||||
|
||||
**Time:** 60-90 minutes
|
||||
|
||||
---
|
||||
|
||||
### QA / Tester
|
||||
**Goal:** Understand what to test
|
||||
|
||||
1. Read DOMAIN_MODEL.md **Testing Strategy** (5 min)
|
||||
2. Read REFACTORING_SUMMARY.md **Testing Checklist** (5 min)
|
||||
3. Read ARCHITECTURE.md **Failure Scenarios** (10 min)
|
||||
4. Read PATTERNS.md **Pattern 6: Testing Aggregates** (15 min)
|
||||
5. Create test plan covering:
|
||||
- Unit tests for commands
|
||||
- Integration tests for full scenarios
|
||||
- Chaos tests for resilience
|
||||
|
||||
**Time:** 40 minutes of planning; then test writing
|
||||
|
||||
---
|
||||
|
||||
### Operator / DevOps
|
||||
**Goal:** Understand how to monitor and operate
|
||||
|
||||
1. Read ARCHITECTURE.md **Monitoring & Observability** (10 min)
|
||||
2. Read ARCHITECTURE.md **Configuration & Tuning** (10 min)
|
||||
3. Read ARCHITECTURE.md **Failure Scenarios** (15 min)
|
||||
4. Plan:
|
||||
- Which metrics to export?
|
||||
- Which alerts to set?
|
||||
- How to detect issues?
|
||||
- How to recover?
|
||||
|
||||
**Time:** 35 minutes
|
||||
|
||||
---
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Invariants
|
||||
Business rules that must NEVER be violated. The core of the domain model.
|
||||
|
||||
- **I1:** At most one leader per term
|
||||
- **I2:** All active shards have owners
|
||||
- **I3:** Shards only assigned to healthy nodes
|
||||
- **I4:** Shard assignments stable during lease
|
||||
- **I5:** Leader is an active node
|
||||
|
||||
### Aggregates
|
||||
Clusters of entities enforcing invariants. Root aggregates own state changes.
|
||||
|
||||
- **Cluster** (root) - owns topology, shard assignments
|
||||
- **LeadershipLease** (root) - owns leadership
|
||||
- **ShardAssignment** (root) - owns shard-to-node mappings
|
||||
|
||||
### Commands
|
||||
Explicit intent to change state. Named with domain language.
|
||||
|
||||
- JoinCluster, MarkNodeFailed, AssignShards, RebalanceShards
|
||||
|
||||
### Events
|
||||
Facts that happened. Published after successful commands.
|
||||
|
||||
- NodeJoined, NodeFailed, LeaderElected, ShardAssigned, ShardMigrated
|
||||
|
||||
### Policies
|
||||
Automated reactions. Connect events to commands.
|
||||
|
||||
- "When NodeJoined then RebalanceShards"
|
||||
- "When LeadershipLost then ElectLeader"
|
||||
|
||||
---
|
||||
|
||||
## Glossary
|
||||
|
||||
| Term | Definition |
|
||||
|------|-----------|
|
||||
| Bounded Context | A boundary within which a domain model is consistent (Cluster Coordination) |
|
||||
| Aggregate | A cluster of entities enforcing business invariants; transactional boundary |
|
||||
| Aggregate Root | The only entity in an aggregate that external code references |
|
||||
| Invariant | A business rule that must always be true |
|
||||
| Command | A request to change state (intent-driven) |
|
||||
| Event | A fact that happened in the past (immutable) |
|
||||
| Policy | An automated reaction to events; connects contexts |
|
||||
| Read Model | A projection of state optimized for queries (no invariants) |
|
||||
| Value Object | Immutable object defined by attributes, not identity |
|
||||
| CQRS | Command Query Responsibility Segregation (commands change state; queries read state) |
|
||||
| Event Sourcing | Storing events as source of truth; state is derived by replay |
|
||||
|
||||
---
|
||||
|
||||
## Related Context Maps
|
||||
|
||||
**Upstream (External Dependencies):**
|
||||
- **NATS** - Provides pub/sub, KV store, JetStream
|
||||
- **Local Runtime** - Executes actors on this node
|
||||
- **Event Store** - Persists cluster events (optional)
|
||||
|
||||
**Downstream (Consumers):**
|
||||
- **Actor Runtime Context** - Migrates actors when shards move (reacts to ShardMigrated)
|
||||
- **Monitoring Context** - Tracks health and events (subscribes to topology events)
|
||||
- **Audit Context** - Records all topology changes (subscribes to all events)
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: Decision Trees
|
||||
|
||||
### Is a node healthy?
|
||||
```
|
||||
Node found? → Check status (Active|Draining|Failed)
|
||||
→ Active/Draining? → YES
|
||||
→ Failed? → NO
|
||||
```
|
||||
|
||||
### Should we rebalance?
|
||||
```
|
||||
Leader? → YES
|
||||
Active nodes? → YES
|
||||
Strategy.Rebalance() → returns new ShardMap
|
||||
Validate invariants? → YES
|
||||
Publish ShardMigrated events
|
||||
```
|
||||
|
||||
### Can we assign shard to node?
|
||||
```
|
||||
Node exists? → YES
|
||||
Status active? → YES
|
||||
Replication < max? → YES
|
||||
Add node to shard's replica list
|
||||
```
|
||||
|
||||
See [ARCHITECTURE.md](./ARCHITECTURE.md) for full decision trees.
|
||||
|
||||
---
|
||||
|
||||
## Testing Resources
|
||||
|
||||
**Test Coverage Map:**
|
||||
- Unit tests: Commands, invariants, value objects
|
||||
- Integration tests: Full scenarios (node join, node fail, rebalance)
|
||||
- Chaos tests: Partitions, cascading failures, high churn
|
||||
|
||||
See [PATTERNS.md](./PATTERNS.md) **Pattern 6** for testing patterns and mocks.
|
||||
|
||||
---
|
||||
|
||||
## Common Questions
|
||||
|
||||
**Q: Why not use Raft for leader election?**
|
||||
A: Lease-based election is simpler and sufficient for our use case. Raft would be safer but more complex. See DOMAIN_MODEL.md **Design Decisions**.
|
||||
|
||||
**Q: What if a leader fails mid-rebalance?**
|
||||
A: New leader will detect incomplete rebalancing and may redo it. This is acceptable (idempotent). See ARCHITECTURE.md **Failure Scenarios**.
|
||||
|
||||
**Q: How many shards should we use?**
|
||||
A: Default 1024 provides good granularity. Tune based on your cluster size. See ARCHITECTURE.md **Configuration & Tuning**.
|
||||
|
||||
**Q: Can actors be lost during rebalancing?**
|
||||
A: No, if the application correctly implements actor migration. See DOMAIN_MODEL.md **Gaps**.
|
||||
|
||||
**Q: Is eventual consistency acceptable?**
|
||||
A: Yes for topology (replicas lag leader by ~100ms). Leadership is strongly consistent (atomic operations). See DOMAIN_MODEL.md **Policies**.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
- [ ] Read DOMAIN_MODEL.md Summary + Invariants
|
||||
- [ ] Read REFACTORING_SUMMARY.md Overview
|
||||
- [ ] Review PATTERNS.md for Phase 1
|
||||
- [ ] Implement Phase 1 commands (JoinCluster, MarkNodeFailed)
|
||||
- [ ] Add tests for Phase 1
|
||||
- [ ] Code review
|
||||
- [ ] Merge Phase 1
|
||||
- [ ] Repeat for Phases 2-4
|
||||
|
||||
---
|
||||
|
||||
## Document Version History
|
||||
|
||||
| Version | Date | Changes |
|
||||
|---------|------|---------|
|
||||
| 1.0 | 2026-01-12 | Initial domain model created |
|
||||
|
||||
---
|
||||
|
||||
## Contact & Questions
|
||||
|
||||
For questions about this domain model:
|
||||
- **Domain modeling:** Refer to DOMAIN_MODEL.md Invariants & Aggregates sections
|
||||
- **Implementation:** Refer to PATTERNS.md for code examples
|
||||
- **Architecture:** Refer to ARCHITECTURE.md for system design
|
||||
- **Refactoring plan:** Refer to REFACTORING_SUMMARY.md for priorities
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- [Vision](../vision.md) - Product vision for Aether
|
||||
- [Project Structure](../README.md) - How this repository is organized
|
||||
- [Event Sourcing Guide](../event.go) - Event and EventStore interface
|
||||
- [NATS Documentation](https://docs.nats.io) - NATS pub/sub and JetStream
|
||||
|
||||
Reference in New Issue
Block a user