flowmade-one/aether

Fork 0

Files

Hugo Nijhuis 271f5db444

CI / build (push) Successful in 21s

Details

CI / integration (push) Failing after 2m1s

Details

Move product strategy documentation to .product-strategy directory

Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-12 23:57:20 +01:00

43 KiB

Raw Blame History

Domain Model: Cluster Coordination

Summary

The Cluster Coordination context manages the distributed topology of actor nodes in an Aether cluster. Its core responsibility is to maintain consistency invariants: exactly one leader per term, all active shards assigned to at least one node, and no orphaned shards. It coordinates node discovery (via NATS heartbeats), leader election (lease-based), shard assignment (via consistent hashing), and rebalancing (when topology changes). The context enforces that only the leader can initiate rebalancing, and that node failures trigger shard reassignment to prevent actor orphaning.

Key insight: Cluster Coordination is not actor placement or routing (that's the application's responsibility via ShardManager). It owns the topology and leadership, enabling routing decisions by publishing shard assignments.

Invariants

These are the business rules that must never be violated:

Invariant 1: Single Leader Per Term

Rule: At any point in time, at most one node is the leader for the current leadership term.
Scope: LeadershipLease aggregate
Why: Multiple leaders (split-brain) lead to conflicting rebalancing decisions and inconsistent shard assignments.
Enforcement: LeaderElection enforces via NATS KV atomic operations (create/update with revision). Only one node can atomically claim the "leader" key.

Invariant 2: All Active Shards Have Owner(s)

Rule: Every shard ID in [0, ShardCount) must be assigned to at least one active node if the cluster is healthy.
Scope: ShardAssignment aggregate
Why: Unassigned shards mean actors on those shards have no home; messages will orphan.
Enforcement: LeaderElection enforces (only leader can assign). ClusterManager validates before applying assignments.

Invariant 3: Assigned Shards Exist on Healthy Nodes Only

Rule: A shard assignment to node N is only valid if N is in NodeStatusActive.
Scope: ShardAssignment + Cluster aggregates (coupled)
Why: Assigning shards to failed nodes means actors can't execute.
Enforcement: When node fails (NodeStatusFailed), leader rebalances shards off that node. handleNodeUpdate marks nodes failed after 90s heartbeat miss.

Invariant 4: Shard Assignments Stable During Leadership Lease

Rule: Shard assignments only change in response to LeaderElected or NodeFailed; they don't arbitrarily shift during a stable leadership term.
Scope: ShardAssignment + LeadershipLease (coupled)
Why: Frequent rebalancing causes thrashing and actor migration overhead.
Enforcement: rebalanceLoop (every 5 min) only runs if leader; triggerShardRebalancing only called on node changes (NodeJoined/Left/Failed).

Invariant 5: Leader Is an Active Node

Rule: If LeaderID is set, the node with that ID must exist in Cluster.nodes with status=Active.
Scope: Cluster + LeadershipLease (coupled)
Why: A failed leader cannot coordinate cluster decisions.
Enforcement: handleNodeUpdate marks nodes failed after timeout; leader renewal fails if node is marked failed. Split-brain risk: partition could allow multiple leaders, but lease expiration + atomic update mitigates.

Aggregates

Aggregate 1: Cluster (Root)

Invariants enforced:

Invariant 2: All active shards have owners
Invariant 3: Shards assigned only to healthy nodes
Invariant 4: Shard assignments stable during leadership lease
Invariant 5: Leader is an active node

Entities:

Cluster (root): Represents the distributed topology and orchestrates rebalancing
- nodes: Map[NodeID → NodeInfo] - all known nodes, their status, load, capacity
- shardMap: ShardMap - current shard-to-node assignments
- hashRing: ConsistentHashRing - used to compute which node owns which shard
- currentLeaderID: String - who is leading this term
- term: uint64 - leadership term counter

Value Objects:

NodeInfo: ID, Address, Port, Status, Capacity, Load, LastSeen, Metadata, VMCount, ShardIDs
- Represents a physical node in the cluster; immutable after creation, mutated only via NodeUpdate commands
ShardMap: Version, Shards (map[ShardID → []NodeID]), Nodes (map[NodeID → NodeInfo]), UpdateTime
- Snapshot of current shard topology; immutable (replaced, not mutated)
NodeStatus: Enum (Active, Draining, Failed)
- Indicates health state of a node

Lifecycle:

Created when: ClusterManager is instantiated (Cluster exists as singleton during runtime)
Destroyed when: Cluster shuts down or node is permanently removed
Transitions:
- NodeJoined → add node to nodes, add to hashRing, trigger rebalance (if leader)
- NodeLeft → remove node from nodes, remove from hashRing, trigger rebalance (if leader)
- NodeFailed (detected) → mark node as failed, trigger rebalance (if leader)
- LeaderElected → update currentLeaderID, may trigger rebalance
- ShardAssigned → update shardMap, increment version

Behavior Methods (not just getters/setters):

addNode(nodeInfo) → NodeJoined event + may trigger rebalance
removeNode(nodeID) → NodeLeft event + trigger rebalance
markNodeFailed(nodeID) → NodeFailed event + trigger rebalance
assignShards(shardMap) → ShardAssigned event (leader only)
rebalanceTopology() → ShardMigrated events (leader only)

Aggregate 2: LeadershipLease (Root)

Invariants enforced:

Invariant 1: Single leader per term
Invariant 5: Leader is an active node

Entities:

LeadershipLease (root): Represents the current leadership claim
- leaderID: String - which node holds the lease
- term: uint64 - monotonically increasing term number
- expiresAt: Timestamp - when this lease expires (now + LeaderLeaseTimeout)
- startedAt: Timestamp - when leader was elected

Value Objects:

None (all properties immutable; lease is replaced, not mutated)

Lifecycle:

Created when: A node wins election and creates the "leader" key in NATS KV
Destroyed when: Lease expires and is not renewed, or leader resigns
Transitions:
- TryBecomeLeader → attempt atomic create; if fails, maybe claim expired lease
- RenewLease (every 3s) → atomically update expiresAt to now + 10s
- LeaseExpired (detected) → remove from KV, allow new election
- NodeFailed (detected) → if failed node is leader, expiration will trigger new election

Behavior Methods:

tryAcquire(nodeID) → LeaderElected event (if succeeds)
renewLease(nodeID) → LeadershipRenewed event (internal, not exposed as command)
isExpired() → Boolean
isLeader(nodeID) → Boolean

Invariant enforcement mechanism:

Atomic operations in NATS KV: Only one node can successfully create "leader" key (or update with correct revision), ensuring single leader per term.
Lease expiration: If leader crashes without renewing, lease expires after 10s, allowing another node to claim it.
Revision-based updates: Update to lease must include correct revision (optimistic concurrency control), preventing stale leader from renewing.

Aggregate 3: ShardAssignment (Root)

Invariants enforced:

Invariant 2: All active shards have owners
Invariant 3: Shards assigned only to healthy nodes

Entities:

ShardAssignment (root): Maps shards to their owning nodes
- version: uint64 - incremented on each change, enables version comparison for replication
- assignments: Map[ShardID → []NodeID] - shard to primary+replica nodes
- nodes: Map[NodeID → NodeInfo] - snapshot of active nodes at assignment time
- updateTime: Timestamp

Value Objects:

None (structure is just data; immutability via replacement)

Lifecycle:

Created when: Cluster initializes (empty assignments)
Updated when: Leader calls rebalanceTopology() → new ShardAssignment created (old one replaced)
Destroyed when: Cluster shuts down

Behavior Methods:

assignShard(shardID, nodeList) → validates all nodes in nodeList are active
rebalanceFromTopology(topology, strategy) → calls strategy to compute new assignments
validateAssignments() → checks all shards assigned, all owners healthy
getAssignmentsForNode(nodeID) → []ShardID

Validation Rules:

All nodes in assignment must be in nodes map with status=Active
All shard IDs in [0, ShardCount) must appear in assignments (no orphans)
Replication factor respected (each shard has 1..ReplicationFactor owners)

Commands

Commands represent user or system intents to change the cluster state. Only aggregates handle commands.

Command 1: JoinCluster

Aggregate: Cluster
Actor: Node joining (or discovery service announcing)
Input: nodeID, address, port, capacity, metadata
Validates:
- nodeID is not empty
- capacity > 0
- address is reachable (optional)
Invariants enforced: Invariant 2 (rebalance if needed)
Success: NodeJoined event published
Failure: DuplicateNodeError (node already in cluster), ValidationError

Command 2: ElectLeader

Aggregate: LeadershipLease
Actor: Node attempting election (triggered periodically)
Input: nodeID, currentTerm
Validates:
- nodeID matches a current cluster member (in Cluster.nodes, status=Active)
- Can attempt if no current leader OR lease is expired
Invariants enforced: Invariant 1, 5
Success: LeaderElected event published (if atomic create succeeds); LeadershipRenewed (if claim expired lease)
Failure: LeaderElectionFailed (atomic operation lost), NodeNotHealthy

Command 3: RenewLeadership

Aggregate: LeadershipLease
Actor: Current leader (triggered every 3s)
Input: nodeID, currentTerm
Validates:
- nodeID is current leader
- term matches current term
- node status is Active (else fail and lose leadership)
Invariants enforced: Invariant 1, 5
Success: LeadershipRenewed (internal event, triggers heartbeat log entry)
Failure: LeadershipLost (node is no longer healthy or lost atomic update race)

Command 4: MarkNodeFailed

Aggregate: Cluster
Actor: System (monitoring service) or leader (if heartbeat misses)
Input: nodeID, reason
Validates:
- nodeID exists in cluster
- node is currently Active (don't re-fail already-failed nodes)
Invariants enforced: Invariant 2, 3, 5 (rebalance to move shards off failed node)
Success: NodeFailed event published; RebalanceTriggered (if leader)
Failure: NodeNotFound, NodeAlreadyFailed

Command 5: AssignShards

Aggregate: ShardAssignment (+ reads Cluster topology)
Actor: Leader (only leader can assign)
Input: nodeID (must be leader), newAssignments (Map[ShardID → []NodeID])
Validates:
- nodeID is current leader
- all nodes in assignments are Active
- all shards in [0, ShardCount) are covered
- replication factor respected
Invariants enforced: Invariant 2, 3 (assignment only valid if all nodes healthy)
Success: ShardAssigned event published with new ShardMap
Failure: NotLeader, InvalidAssignment (node not found), UnhealthyNode, IncompleteAssignment (missing shards)

Command 6: RebalanceShards

Aggregate: Cluster (orchestrates) + ShardAssignment (executes)
Actor: Leader (triggered by node changes or periodic check)
Input: nodeID (must be leader), strategy (optional placement strategy)
Validates:
- nodeID is current leader
- cluster has active nodes
Invariants enforced: Invariant 2 (all shards still assigned), Invariant 3 (only to healthy nodes)
Success: RebalancingCompleted event; zero or more ShardMigrated events (one per shard moved)
Failure: NotLeader, NoActiveNodes, RebalancingFailed (unexpected topology change mid-rebalance)

Events

Events represent facts that happened. They are published after successful command execution.

Event 1: NodeJoined

Triggered by: JoinCluster command
Aggregate: Cluster
Data: nodeID, address, port, capacity, metadata, timestamp
Consumed by:
- Cluster (adds node to ring)
- Policies (RebalancingTriggerPolicy)
Semantics: A new node entered the cluster and is ready to host actors
Immutability: Once published, never changes

Event 2: NodeDiscovered

Triggered by: NodeDiscovery announces node via NATS pub/sub (implicit)
Aggregate: Cluster (discovery feeds into cluster topology)
Data: nodeID, nodeInfo, timestamp
Consumed by: Cluster topology sync
Semantics: Node became visible to the cluster; may be new or rediscovered after network partition
Note: Implicit event; not explicitly commanded, but captured in domain language

Event 3: LeaderElected

Triggered by: ElectLeader command (atomic KV create succeeds) or ReclaimExpiredLease
Aggregate: LeadershipLease
Data: leaderID, term, expiresAt, startedAt, timestamp
Consumed by:
- Cluster (updates currentLeaderID)
- Policies (LeaderElectionCompletePolicy)
Semantics: A node has acquired leadership for the given term
Guarantee: At most one node can succeed in creating this event per term

Event 4: LeadershipLost

Triggered by: Lease expires (detected by monitorLeadership watcher) or RenewLeadership fails
Aggregate: LeadershipLease
Data: leaderID, term, reason (LeaseExpired, FailedToRenew, NodeFailed), timestamp
Consumed by:
- Cluster (clears currentLeaderID)
- Policies (trigger new election)
Semantics: The leader is no longer valid and coordination authority is vacant
Trigger: No renewal received for 10s, or atomic update fails

Event 5: LeadershipRenewed

Triggered by: RenewLeadership command (succeeds every 3s)
Aggregate: LeadershipLease
Data: leaderID, term, expiresAt, timestamp
Consumed by: Internal use (heartbeat signal); not published to other contexts
Semantics: Leader is alive and ready to coordinate
Frequency: Every 3s per leader

Event 6: ShardAssigned

Triggered by: AssignShards command or RebalanceShards command
Aggregate: ShardAssignment
Data: shardID, nodeIDs (primary + replicas), version, timestamp
Consumed by:
- ShardManager (updates routing)
- Policies (ShardOwnershipPolicy)
- Other contexts (if they subscribe to shard topology changes)
Semantics: Shard N is now owned by these nodes (primary first)
Bulk event: Often published multiple times in one rebalance operation

Event 7: NodeFailed

Triggered by: MarkNodeFailed command
Aggregate: Cluster
Data: nodeID, reason (HeartbeatTimeout, AdminMarked, etc.), timestamp
Consumed by:
- Cluster (removes from active pool)
- Policies (RebalancingTriggerPolicy, actor migration)
- Other contexts (may need to relocate actors)
Semantics: Node is unresponsive and should be treated as offline
Detection: heartbeat miss after 90s, or explicit admin action

Event 8: NodeLeft

Triggered by: Node gracefully shuts down (announceNode(NodeLeft)) or MarkNodeFailed (for draining)
Aggregate: Cluster
Data: nodeID, reason (GracefulShutdown, AdminRemoved, etc.), timestamp
Consumed by: Policies (same as NodeFailed, triggers rebalance)
Semantics: Node is intentionally leaving and will not rejoin
Difference from NodeFailed: Intent signal; failed nodes might rejoin after network partition heals

Event 9: ShardMigrated

Triggered by: RebalanceShards command (one event per shard reassigned)
Aggregate: Cluster
Data: shardID, fromNodes (old owners), toNodes (new owners), timestamp
Consumed by:
- Local runtime (via ShardManager; triggers actor migration)
- Other contexts (if they track actor locations)
Semantics: A shard's ownership changed; actors on that shard may need to migrate
Migration strategy: Application owns how to move actors (via ActorMigration); cluster just signals the change

Event 10: RebalancingTriggered

Triggered by: RebalanceShards command (start)
Aggregate: Cluster
Data: leaderID, reason (NodeJoined, NodeFailed, Manual), timestamp
Consumed by: Monitoring/debugging
Semantics: Leader has initiated a rebalancing cycle
Note: Informational; subsequent ShardMigrated events describe the actual changes

Event 11: RebalancingCompleted

Triggered by: RebalanceShards command (finish)
Aggregate: Cluster
Data: leaderID, completedAt, migrationsCount, timestamp
Consumed by: Monitoring/debugging, other contexts may wait for this before proceeding
Semantics: All shard migrations have been assigned; doesn't mean they're complete on actors
Note: ShardMigrated is the signal to move actors; this is the coordination signal

Policies

Policies are automated reactions to events. They connect events to commands across aggregates and contexts.

Policy 1: Single Leader Policy

Trigger: When LeadershipLost event
Action: Any node can attempt ElectLeader command
Context: Only one will succeed due to atomic NATS KV operation
Rationale: Ensure leadership is re-established quickly after vacancy
Implementation: electionLoop in LeaderElection runs every 2s, calls tryBecomeLeader if not leader

Policy 2: Lease Renewal Policy

Trigger: Periodic timer (every 3s)
Action: If IsLeader, execute RenewLeadership command
Context: Heartbeat mechanism to prove leader is alive
Rationale: Detect leader failure via lease expiration after 10s inactivity
Implementation: leaseRenewalLoop in LeaderElection; failure triggers loseLeadership()

Policy 3: Lease Expiration Policy

Trigger: When LeadershipLease.expiresAt < now (detected by monitorLeadership watcher)
Action: Clear currentLeader, publish LeadershipLost, trigger SingleLeaderPolicy
Context: Automatic failover when leader stops renewing
Rationale: Prevent stale leaders from coordinating during network partitions
Implementation: monitorLeadership watches "leader" KV key; if deleted or expired, calls handleLeadershipUpdate

Policy 4: Node Heartbeat Policy

Trigger: Periodic timer (every 30s) - NodeDiscovery announces
Action: Publish node status via NATS "aether.discovery" subject
Context: Membership discovery; all nodes broadcast presence
Rationale: Other nodes learn topology via heartbeats; leader detects failures via absence
Implementation: NodeDiscovery.Start() runs heartbeat ticker

Policy 5: Node Failure Detection Policy

Trigger: When NodeUpdate received with LastSeen > 90s ago
Action: Mark node as NodeStatusFailed; if leader, trigger RebalanceShards
Context: Eventual failure detection (passive, via heartbeat miss)
Rationale: Failed nodes may still hold shard assignments; rebalance moves shards to healthy nodes
Implementation: handleNodeUpdate checks LastSeen and marks nodes failed; checkNodeHealth periodic check

Policy 6: Shard Rebalancing Trigger Policy

Trigger: When NodeJoined, NodeLeft, or NodeFailed event
Action: If leader, execute RebalanceShards command
Context: Topology change → redistribute actors
Rationale: New node should get load; failed node's shards must be reassigned
Implementation: handleNodeUpdate calls triggerShardRebalancing if leader

Policy 7: Shard Ownership Enforcement Policy

Trigger: When ShardAssigned event
Action: Update local ShardMap; nodes use this for actor routing
Context: All nodes must agree on shard ownership for routing consistency
Rationale: Single source of truth (published by leader) prevents routing conflicts
Implementation: ClusterManager receives ShardAssigned via NATS; updates shardMap

Policy 8: Shard Coverage Policy

Trigger: Periodic check (every 5 min) or after NodeFailed
Action: Validate all shards in [0, ShardCount) are assigned; if any missing, trigger RebalanceShards
Context: Safety check to prevent shard orphaning
Rationale: Ensure no actor can be born on an unassigned shard
Implementation: rebalanceLoop calls triggerShardRebalancing with reason "periodic rebalance check"

Policy 9: Leader-Only Rebalancing Policy

Trigger: RebalanceShards command
Action: Validate nodeID is currentLeader before executing
Context: Only leader can initiate topology changes
Rationale: Prevent cascading rebalancing from multiple nodes; single coordinator
Implementation: triggerShardRebalancing checks IsLeader() at start

Policy 10: Graceful Shutdown Policy

Trigger: NodeDiscovery.Stop() called
Action: Publish NodeLeft event
Context: Signal that this node is intentionally leaving
Rationale: Other nodes should rebalance shards away from this node; different from failure
Implementation: Stop() calls announceNode(NodeLeft) before shutting down

Read Models

Read models project state for queries. They have no invariants and can be eventually consistent.

Read Model 1: GetClusterTopology

Purpose: What nodes are currently in the cluster?
Data:
- nodes: []NodeInfo (filtered to status=Active only)
- timestamp: When snapshot was taken
Source: Cluster.nodes, filtered by status != Failed
Updated: After NodeJoined, NodeLeft, NodeFailed events
Queryable by: nodeID, status, capacity, load
Eventual consistency: Replica nodes lag leader by a few heartbeats

Read Model 2: GetLeader

Purpose: Who is the current leader?
Data:
- leaderID: Current leader node ID, or null if no leader
- term: Leadership term number
- expiresAt: When current leadership lease expires
- confidence: "high" (just renewed), "medium" (recent), "low" (about to expire)
Source: LeadershipLease
Updated: After LeaderElected, LeadershipRenewed, LeadershipLost events
Queryable by: leaderID, term, expiration time
Eventual consistency: Non-leader nodes lag by up to 10s (lease timeout)

Read Model 3: GetShardAssignments

Purpose: Where does each shard live?
Data:
- shardID: Shard number
- primaryNode: Node ID (shardMap.Shards[shardID][0])
- replicaNodes: []NodeID (shardMap.Shards[shardID][1:])
- version: ShardMap version (for optimistic concurrency)
Source: Cluster.shardMap
Updated: After ShardAssigned, ShardMigrated events
Queryable by: shardID, nodeID (which shards does node own?)
Eventual consistency: Replicas lag leader by one NATS publish; consistent within a term

Read Model 4: GetNodeHealth

Purpose: Is a given node healthy?
Data:
- nodeID: Node identifier
- status: Active | Draining | Failed
- lastSeen: Last heartbeat timestamp
- downForSeconds: (now - lastSeen)
Source: Cluster.nodes[nodeID]
Updated: After NodeJoined, NodeUpdated, NodeFailed events
Queryable by: nodeID, status threshold (e.g., "give me all failed nodes")
Eventual consistency: Non-leader nodes lag by 30s (heartbeat interval)

Read Model 5: GetRebalancingStatus

Purpose: Is rebalancing in progress? How many shards moved?
Data:
- isRebalancing: Boolean
- startedAt: Timestamp
- reason: "node_joined" | "node_failed" | "periodic" | "manual"
- completedCount: Number of shards finished
- totalCount: Total shards to move
Source: RebalancingTriggered, ShardMigrated, RebalancingCompleted events
Updated: On rebalancing events
Queryable by: current status, started within N seconds
Eventual consistency: Replicas lag by one NATS publish

Value Objects

Value Object 1: NodeInfo

Represents a physical node in the cluster.

Fields:

ID: string - unique identifier
Address: string - IP or hostname
Port: int - NATS port
Status: NodeStatus enum (Active, Draining, Failed)
Capacity: float64 - max load capacity
Load: float64 - current load
LastSeen: time.Time - last heartbeat
Timestamp: time.Time - when created/updated
Metadata: map[string]string - arbitrary tags (region, version, etc.)
IsLeader: bool - is this the leader?
VMCount: int - how many actors on this node
ShardIDs: []int - which shards are assigned

Equality: Two NodeInfos are equal if all fields match (identity-based for clustering purposes, but immutable)

Validation:

ID non-empty
Capacity > 0
Status in {Active, Draining, Failed}
Port in valid range [1, 65535]

Value Object 2: ShardMap

Represents the current shard-to-node assignment snapshot.

Fields:

Version: uint64 - incremented on each change; used for optimistic concurrency
Shards: Map[ShardID → []NodeID] - shard to [primary, replica1, replica2, ...]
Nodes: Map[NodeID → NodeInfo] - snapshot of nodes known at assignment time
UpdateTime: time.Time - when created

Equality: Two ShardMaps are equal if Version and Shards are equal (Nodes is metadata)

Validation:

All shard IDs in [0, ShardCount)
All node IDs in Shards exist in Nodes
All nodes in Nodes have status=Active
Replication factor respected (1 ≤ len(Shards[sid]) ≤ ReplicationFactor)

Immutability: ShardMap is never mutated; rebalancing creates a new ShardMap

Value Object 3: LeadershipLease

Represents a leader's claim on coordination authority.

Fields:

LeaderID: string - node ID holding the lease
Term: uint64 - monotonically increasing term number
ExpiresAt: time.Time - when lease is no longer valid
StartedAt: time.Time - when leader was elected

Equality: Two leases are equal if LeaderID, Term, and ExpiresAt match

Validation:

LeaderID non-empty
Term ≥ 0
ExpiresAt > StartedAt
ExpiresAt - StartedAt == LeaderLeaseTimeout

Lifecycle:

Created: node wins election
Renewed: every 3s, ExpiresAt extended
Expired: if ExpiresAt < now and not renewed
Replaced: next term when new leader elected

Value Object 4: Term

Represents a leadership term (could be extracted for clarity).

Fields:

Number: uint64 - term counter

Semantics: Monotonically increasing; each new leader gets a higher term. Used to detect stale messages.

Code Analysis

Intended vs Actual: ClusterManager

Intended (from Domain Model):

Root aggregate owning Cluster topology
Enforces invariants: shard coverage, healthy node assignments, rebalancing triggers
Commands: JoinCluster, MarkNodeFailed, RebalanceShards
Events: NodeJoined, NodeFailed, ShardAssigned, ShardMigrated

Actual (from /cluster/manager.go):

Partially aggregate-like: owns nodes, shardMap, hashRing
Lacks explicit command methods: has handleClusterMessage() but not named commands like JoinCluster()
Lacks explicit event publishing: updates state but doesn't publish domain events
Invariant enforcement scattered: node failure detection in handleNodeUpdate(), but no central validation
Missing behavior: shard assignment logic in ShardManager, not in Cluster aggregate

Misalignment:

Anemic aggregate: ClusterManager reads/writes state but doesn't enforce invariants or publish events
Responsibility split: Cluster topology (Manager) vs shard assignment (ShardManager) vs leadership (LeaderElection) are not unified under one aggregate root
No explicit commands: Node updates handled via generic message dispatcher, not domain-language commands
No event sourcing: State changes don't produce events

Gaps:

No JoinCluster command handler
No MarkNodeFailed command handler (only handleNodeUpdate which detects failures)
No explicit ShardAssigned/ShardMigrated events
Rebalancing triggers exist (triggerShardRebalancing) but not as domain commands

Intended vs Actual: LeaderElection

Intended (from Domain Model):

Root aggregate owning LeadershipLease invariant (single leader per term)
Commands: ElectLeader, RenewLeadership
Events: LeaderElected, LeadershipLost, LeadershipRenewed

Actual (from /cluster/leader.go):

Correctly implements lease-based election with NATS KV
Enforces single leader via atomic operations (create, update with revision)
Has implicit command pattern (tryBecomeLeader, renewLease, resignLeadership)
Has callbacks for leadership change, but no explicit event publishing

Alignment:

Atomic operations correctly enforce Invariant 1 (single leader)
Lease renewal every 3s enforces lease validity
Lease expiration detected via watcher
Leadership transitions (elected, lost) well-modeled

Gaps:

Events not explicitly published; callbacks used instead
No event sourcing (events should be recorded in event store, not just callbacks)
No term-based validation (could reject stale messages with old term)
Could be more explicit about LeaderElected event vs just callback

Intended vs Actual: ConsistentHashRing

Intended (from Domain Model):

Used by ShardAssignment to compute which node owns which shard
Policy: shards assigned via consistent hashing
Minimizes reshuffling on node join/leave

Actual (from /cluster/hashring.go):

Correctly implements consistent hash ring with virtual nodes
AddNode/RemoveNode operations are clean
GetNode(key) returns responsible node; used for actor placement

Alignment:

Good separation of concerns (ring is utility, not aggregate)
Virtual nodes (150 per node) reduce reshuffling on node change
Immutable ring structure (recreated on changes)

Gaps:

Not actively used by ShardAssignment (ShardManager has own hash logic)
Could be used by RebalanceShards policy to compute initial assignments
Currently more of a utility than a policy

Intended vs Actual: ShardManager

Intended (from Domain Model):

ShardAssignment aggregate managing shard-to-node mappings
Commands: AssignShard, RebalanceShards (via PlacementStrategy)
Enforces invariants: all shards assigned, only to healthy nodes
Emits ShardAssigned events

Actual (from /cluster/shard.go):

Owns ShardMap, but like ClusterManager, is more of a data holder than aggregate
Has methods: AssignShard, RebalanceShards (delegates to PlacementStrategy)
Lacks invariant validation (doesn't check if nodes are healthy)
Lacks event publishing

Alignment:

PlacementStrategy pattern allows different algorithms (good design)
ConsistentHashPlacement exists but is stubbed

Gaps:

ShardManager.RebalanceShards not integrated with ClusterManager's decision to rebalance
No event publishing on shard changes
Invariant validation needed: validate nodes in assignments are healthy

Intended vs Actual: NodeDiscovery

Intended (from Domain Model):

Detects nodes via NATS heartbeats
Publishes NodeJoined, NodeUpdated, NodeLeft events via announceNode
Triggers policies (node failure detection, rebalancing)

Actual (from /cluster/discovery.go):

Heartbeats every 30s via announceNode
Subscribes to "aether.discovery" channel
Publishes NodeUpdate messages, not domain events

Alignment:

Heartbeat mechanism good; detected failure via 90s timeout in ClusterManager
Message-based communication works for event bus

Gaps:

NodeUpdate is not a domain event; should publish NodeJoined, NodeUpdated, NodeLeft as explicit events
Could be clearer about lifecycle: Start announces NodeJoined, Stop announces NodeLeft

Intended vs Actual: DistributedVM

Intended (from Domain Model):

Orchestrates all cluster components (discovery, election, coordination, sharding)
Not itself an aggregate; more of a façade/orchestrator

Actual (from /cluster/distributed.go):

Correctly orchestrates: discovery + cluster manager + sharding + local runtime
DistributedVMRegistry provides VMRegistry interface to ClusterManager
Good separation: doesn't force topology decisions on runtime

Alignment:

Architecture clean; each component has clear responsibility
Decoupling via interfaces (Runtime, VirtualMachine, VMProvider) is good

Gaps:

No explicit orchestration logic (Start method incomplete; only shown first 100 lines)
Could coordinate startup order more explicitly

Refactoring Backlog

Refactoring 1: Extract Cluster Aggregate from ClusterManager

Current: ClusterManager is anemic; only stores state Target: ClusterManager becomes true aggregate root enforcing invariants

Steps:

Add explicit command methods to ClusterManager:
- JoinCluster(nodeInfo NodeInfo) error
- MarkNodeFailed(nodeID string) error
- AssignShards(shardMap ShardMap) error
- RebalanceTopology(reason string) error
Each command:
- Validates preconditions
- Calls aggregate behavior (private methods)
- Publishes events
- Returns result
Add event publishing:
- Create EventPublisher interface in ClusterManager
- Publish NodeJoined, NodeFailed, ShardAssigned, ShardMigrated events
- Events captured in event store (optional, or via NATS pub/sub)

Impact: Medium - changes ClusterManager interface but not external APIs yet Priority: High - unblocks event-driven integration with other contexts

Refactoring 2: Extract ShardAssignment Commands from RebalanceShards

Current: ShardManager.RebalanceShards delegates to PlacementStrategy; no validation of healthy nodes Target: ShardAssignment commands validate invariants

Steps:

Add to ShardManager:
- AssignShards(assignments map[int][]string, nodes map[string]*NodeInfo) error
  - Validates: all nodes exist and are Active
  - Validates: all shards in [0, ShardCount) assigned
  - Validates: replication factor respected
- ValidateAssignments() error
Move shard validation from coordinator to ShardManager
Publish ShardAssigned events on successful assignment
Update ClusterManager to call ShardManager.AssignShards instead of directly mutating ShardMap

Impact: Medium - clarifies shard aggregate, adds validation Priority: High - prevents invalid shard assignments

Refactoring 3: Publish Domain Events from LeaderElection

Current: LeaderElection uses callbacks; no event sourcing Target: Explicit event publishing for leader changes

Steps:

Add EventPublisher interface to LeaderElection
In becomeLeader: publish LeaderElected event
In loseLeadership: publish LeadershipLost event
Optional: publish LeadershipRenewed on each renewal (for audit trail)
Events include: leaderID, term, expiresAt, timestamp
Consumers subscribe via NATS and react (no longer callbacks)

Impact: Medium - changes LeaderElection interface Priority: Medium - improves observability and enables event sourcing

Refactoring 4: Unify Node Failure Detection and Rebalancing

Current: Node failure detected in handleNodeUpdate (90s timeout) + periodic checkNodeHealth; rebalancing trigger spread across multiple methods Target: Explicit MarkNodeFailed command, single rebalancing trigger

Steps:

Create explicit MarkNodeFailed command handler
Move node failure detection logic to ClusterManager.markNodeFailed()
Consolidate node failure checks (remove duplicate in checkNodeHealth)
Trigger rebalancing only from MarkNodeFailed, not scattered
Add RebalancingTriggered event before starting rebalance

Impact: Low - refactoring existing logic, not new behavior Priority: Medium - improves clarity

Refactoring 5: Implement PlacementStrategy for Rebalancing

Current: ConsistentHashPlacement.RebalanceShards is stubbed Target: Real rebalancing logic using consistent hashing

Steps:

Implement ConsistentHashPlacement.RebalanceShards:
- Input: current ShardMap, updated nodes (may have added/removed)
- Output: new ShardMap with shards redistributed via consistent hash
- Minimize movement: use virtual nodes to keep most shards in place
Add RebalancingStrategy interface if other strategies needed (e.g., load-aware)
Test: verify adding/removing node only reshuffles ~1/N shards

Impact: Medium - core rebalancing logic, affects all topology changes Priority: High - currently rebalancing doesn't actually redistribute

Refactoring 6: Add Node Health Check Endpoint

Current: No way to query node health directly Target: Read model for GetNodeHealth

Steps:

Add method to ClusterManager: GetNodeHealth(nodeID string) NodeHealthStatus
Return: status, lastSeen, downForSeconds
Expose via NATS request/reply (if distributed query needed)
Test: verify timeout logic

Impact: Low - new query method, no state changes Priority: Low - nice to have for monitoring

Refactoring 7: Add Shard Migration Tracking

Current: ShardMigrated event published, but no tracking of migration progress Target: ActorMigration status tracking and completion callback

Steps:

Add MigrationTracker in cluster package
On ShardMigrated event: create migration record (pending)
Application reports migration progress (in_progress, completed, failed)
On completion: remove from tracker
Rebalancing can wait for migrations to complete before declaring rebalance done

Impact: High - affects how rebalancing coordinates with application Priority: Medium - improves robustness (don't rebalance while migrations in flight)

Testing Strategy

Unit Tests

LeaderElection invariant tests:

Only one node can successfully create "leader" key → test atomic create succeeds once, fails second time
Lease expiration triggers new election → create expired lease, verify election succeeds
Lease renewal extends expiry → create lease, renew, verify new expiry is ~10s from now
Stale leader can't renew → mark node failed, verify renewal fails

Cluster topology invariant tests:

NodeJoined adds to hashRing → call addNode, verify GetNode routes consistently
NodeFailed triggers rebalance → call markNodeFailed, verify rebalance triggered
Shard coverage validated → rebalance with 100 nodes, 1024 shards, verify all shards assigned
Only healthy nodes get shards → assign to failed node, verify rejected

ShardManager invariant tests:

AssignShards validates node health → assign to failed node, verify error
RebalanceShards covers all shards → simulate topology change, verify no orphans
Virtual nodes minimize reshuffling → add node, verify < 1/N shards move

Integration Tests

Single leader election:

Create 3 cluster nodes
Verify exactly one becomes leader
Stop leader
Verify new leader elected within 10s
Test: leadership term increments

Node failure and recovery:

Create 5-node cluster with 100 shards
Mark node-2 failed
Verify shards reassigned from node-2 to others
Verify node-3 doesn't become unreasonably overloaded
Restart node-2
Verify shards rebalanced back

Graceful shutdown:

Create 3-node cluster
Gracefully stop node-1 (announces NodeLeft)
Verify no 90s timeout; rebalancing happens immediately
Compare to failure case (90s delay)

Split-brain recovery:

Create 3-node cluster: [A(leader), B, C]
Partition network: A isolated, B+C connected
Verify A loses leadership after 10s
Verify B or C becomes leader
Heal partition
Verify single leader, no conflicts (A didn't try to be leader again)

Rebalancing under load:

Create 5-node cluster, 100 shards, with actors running
Add node-6
Verify actors migrated off other nodes to node-6
No actors are orphaned (all still reachable)
Measure: reshuffled < 1/5 of shards

Chaos Testing

Leader failure mid-rebalance → verify rebalancing resumed by new leader
Network partition (leader isolated) → verify quorum (or lease) ensures no split-brain
Cascading failures → 5 nodes, fail 3 at once, verify cluster stabilizes
High churn → nodes join/leave rapidly, verify topology converges

Boundary Conditions and Limitations

Design Decisions

Why lease-based election instead of Raft?

Simpler to implement and reason about
Detect failure in 10s (acceptable for coordination)
Risk: split-brain if network partition persists > 10s and both partitions have nodes (mitigation: leader must renew in each partition; only one will have NATS connection)

Why leader-only rebalancing?

Prevent cascading rebalancing decisions
Single source of truth (leader decides topology)
Risk: leader bottleneck if rebalancing is expensive (mitigation: leader can delegate to algorithms, not compute itself)

Why consistent hashing instead of load-balancing?

Minimize shard movement on topology change (good for actor locality)
Deterministic without central state (nodes can independently compute assignments)
Risk: load imbalance if actors heavily skewed (mitigation: application can use custom PlacementStrategy)

Why 90s failure detection timeout?

3 heartbeats missed (30s * 3) before declaring failure
Allows for some network jitter without false positives
Risk: slow failure detection (mitigation: application can force MarkNodeFailed if it detects failure faster)

Assumptions

NATS cluster is available: If NATS is down, cluster can't communicate (no failover without NATS)
Clocks are reasonably synchronized: Lease expiration depends on wall clock; major clock skew can break election
Network partitions are rare: Split-brain only possible if partition > 10s and leader isolated
Rebalancing is not time-critical: 5-min periodic check is default; no SLA on shard assignment latency

Known Gaps

No quorum-based election: Single leader with lease; could add quorum for stronger consistency (Raft-like)
No actor migration semantics: Who actually moves actors? Cluster signals ShardMigrated, but application must handle
No topology versioning: ShardMap has version, but no way to detect if a node has an outdated topology
No leader handoff during rebalancing: If leader fails mid-rebalance, new leader might redo already-started migrations
No split-brain detection: Cluster can't detect if two leaders somehow exist (NATS KV prevents it, but cluster doesn't enforce it)

Alignment with Product Vision

Primitives Over Frameworks:

Cluster Coordination provides primitives (leader election, shard assignment), not a complete framework
Application owns actor migration strategy (via ShardManager PlacementStrategy)
Application owns failure response (can custom-implement node monitoring)

NATS-Native:

Leader election uses NATS KV for atomic operations
Node discovery uses NATS pub/sub for heartbeats
Shard topology can be published via NATS events

Event-Sourced:

All topology changes produce events (NodeJoined, NodeFailed, ShardAssigned, ShardMigrated)
Events enable audit trail and replay (who owns which shard when?)

Resource Conscious:

Minimal overhead: consistent hashing avoids per-node state explosion
Lease-based election lighter than Raft (no log replication)
Virtual nodes (150) on modest hardware

References

Lease-based election: Inspired by Chubby, Google's lock service
Consistent hashing: Karger et al., "Consistent Hashing and Random Trees"
Virtual nodes: Reduces reshuffling on topology change (Dynamo, Cassandra pattern)
NATS KV: Used for atomicity; alternatives: etcd, Consul (but less NATS-native)

43 KiB Raw Blame History

Domain Model: Cluster Coordination

Summary

Invariants

Invariant 1: Single Leader Per Term

Invariant 2: All Active Shards Have Owner(s)

Invariant 3: Assigned Shards Exist on Healthy Nodes Only

Invariant 4: Shard Assignments Stable During Leadership Lease

Invariant 5: Leader Is an Active Node

Aggregates

Aggregate 1: Cluster (Root)

Aggregate 2: LeadershipLease (Root)

Aggregate 3: ShardAssignment (Root)

Commands

Command 1: JoinCluster

Command 2: ElectLeader

Command 3: RenewLeadership

Command 4: MarkNodeFailed

Command 5: AssignShards

Command 6: RebalanceShards

Events

Event 1: NodeJoined

Event 2: NodeDiscovered

Event 3: LeaderElected

Event 4: LeadershipLost

Event 5: LeadershipRenewed

Event 6: ShardAssigned

Event 7: NodeFailed

Event 8: NodeLeft

Event 9: ShardMigrated

Event 10: RebalancingTriggered

Event 11: RebalancingCompleted

Policies

Policy 1: Single Leader Policy

Policy 2: Lease Renewal Policy

Policy 3: Lease Expiration Policy

Policy 4: Node Heartbeat Policy

Policy 5: Node Failure Detection Policy

Policy 6: Shard Rebalancing Trigger Policy

Policy 7: Shard Ownership Enforcement Policy

Policy 8: Shard Coverage Policy

Policy 9: Leader-Only Rebalancing Policy

Policy 10: Graceful Shutdown Policy

Read Models

Read Model 1: GetClusterTopology

Read Model 2: GetLeader

Read Model 3: GetShardAssignments

Read Model 4: GetNodeHealth

Read Model 5: GetRebalancingStatus

Value Objects

Value Object 1: NodeInfo

Value Object 2: ShardMap

Value Object 3: LeadershipLease

Value Object 4: Term

Code Analysis

Intended vs Actual: ClusterManager

Intended vs Actual: LeaderElection

Intended vs Actual: ConsistentHashRing

Intended vs Actual: ShardManager

Intended vs Actual: NodeDiscovery

Intended vs Actual: DistributedVM

Refactoring Backlog

Refactoring 1: Extract Cluster Aggregate from ClusterManager

Refactoring 2: Extract ShardAssignment Commands from RebalanceShards

Refactoring 3: Publish Domain Events from LeaderElection

Refactoring 4: Unify Node Failure Detection and Rebalancing

Refactoring 5: Implement PlacementStrategy for Rebalancing

Refactoring 6: Add Node Health Check Endpoint

Refactoring 7: Add Shard Migration Tracking

Testing Strategy

Unit Tests

Integration Tests

Chaos Testing

Boundary Conditions and Limitations

Design Decisions

Assumptions

Known Gaps

Alignment with Product Vision

References

43 KiB

Raw Blame History