Files
aether/.product-strategy/cluster/DOMAIN_MODEL.md
Hugo Nijhuis 271f5db444
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s
Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:20 +01:00

43 KiB

Domain Model: Cluster Coordination

Summary

The Cluster Coordination context manages the distributed topology of actor nodes in an Aether cluster. Its core responsibility is to maintain consistency invariants: exactly one leader per term, all active shards assigned to at least one node, and no orphaned shards. It coordinates node discovery (via NATS heartbeats), leader election (lease-based), shard assignment (via consistent hashing), and rebalancing (when topology changes). The context enforces that only the leader can initiate rebalancing, and that node failures trigger shard reassignment to prevent actor orphaning.

Key insight: Cluster Coordination is not actor placement or routing (that's the application's responsibility via ShardManager). It owns the topology and leadership, enabling routing decisions by publishing shard assignments.


Invariants

These are the business rules that must never be violated:

Invariant 1: Single Leader Per Term

  • Rule: At any point in time, at most one node is the leader for the current leadership term.
  • Scope: LeadershipLease aggregate
  • Why: Multiple leaders (split-brain) lead to conflicting rebalancing decisions and inconsistent shard assignments.
  • Enforcement: LeaderElection enforces via NATS KV atomic operations (create/update with revision). Only one node can atomically claim the "leader" key.

Invariant 2: All Active Shards Have Owner(s)

  • Rule: Every shard ID in [0, ShardCount) must be assigned to at least one active node if the cluster is healthy.
  • Scope: ShardAssignment aggregate
  • Why: Unassigned shards mean actors on those shards have no home; messages will orphan.
  • Enforcement: LeaderElection enforces (only leader can assign). ClusterManager validates before applying assignments.

Invariant 3: Assigned Shards Exist on Healthy Nodes Only

  • Rule: A shard assignment to node N is only valid if N is in NodeStatusActive.
  • Scope: ShardAssignment + Cluster aggregates (coupled)
  • Why: Assigning shards to failed nodes means actors can't execute.
  • Enforcement: When node fails (NodeStatusFailed), leader rebalances shards off that node. handleNodeUpdate marks nodes failed after 90s heartbeat miss.

Invariant 4: Shard Assignments Stable During Leadership Lease

  • Rule: Shard assignments only change in response to LeaderElected or NodeFailed; they don't arbitrarily shift during a stable leadership term.
  • Scope: ShardAssignment + LeadershipLease (coupled)
  • Why: Frequent rebalancing causes thrashing and actor migration overhead.
  • Enforcement: rebalanceLoop (every 5 min) only runs if leader; triggerShardRebalancing only called on node changes (NodeJoined/Left/Failed).

Invariant 5: Leader Is an Active Node

  • Rule: If LeaderID is set, the node with that ID must exist in Cluster.nodes with status=Active.
  • Scope: Cluster + LeadershipLease (coupled)
  • Why: A failed leader cannot coordinate cluster decisions.
  • Enforcement: handleNodeUpdate marks nodes failed after timeout; leader renewal fails if node is marked failed. Split-brain risk: partition could allow multiple leaders, but lease expiration + atomic update mitigates.

Aggregates

Aggregate 1: Cluster (Root)

Invariants enforced:

  • Invariant 2: All active shards have owners
  • Invariant 3: Shards assigned only to healthy nodes
  • Invariant 4: Shard assignments stable during leadership lease
  • Invariant 5: Leader is an active node

Entities:

  • Cluster (root): Represents the distributed topology and orchestrates rebalancing
    • nodes: Map[NodeID → NodeInfo] - all known nodes, their status, load, capacity
    • shardMap: ShardMap - current shard-to-node assignments
    • hashRing: ConsistentHashRing - used to compute which node owns which shard
    • currentLeaderID: String - who is leading this term
    • term: uint64 - leadership term counter

Value Objects:

  • NodeInfo: ID, Address, Port, Status, Capacity, Load, LastSeen, Metadata, VMCount, ShardIDs
    • Represents a physical node in the cluster; immutable after creation, mutated only via NodeUpdate commands
  • ShardMap: Version, Shards (map[ShardID → []NodeID]), Nodes (map[NodeID → NodeInfo]), UpdateTime
    • Snapshot of current shard topology; immutable (replaced, not mutated)
  • NodeStatus: Enum (Active, Draining, Failed)
    • Indicates health state of a node

Lifecycle:

  • Created when: ClusterManager is instantiated (Cluster exists as singleton during runtime)
  • Destroyed when: Cluster shuts down or node is permanently removed
  • Transitions:
    • NodeJoined → add node to nodes, add to hashRing, trigger rebalance (if leader)
    • NodeLeft → remove node from nodes, remove from hashRing, trigger rebalance (if leader)
    • NodeFailed (detected) → mark node as failed, trigger rebalance (if leader)
    • LeaderElected → update currentLeaderID, may trigger rebalance
    • ShardAssigned → update shardMap, increment version

Behavior Methods (not just getters/setters):

  • addNode(nodeInfo) → NodeJoined event + may trigger rebalance
  • removeNode(nodeID) → NodeLeft event + trigger rebalance
  • markNodeFailed(nodeID) → NodeFailed event + trigger rebalance
  • assignShards(shardMap) → ShardAssigned event (leader only)
  • rebalanceTopology() → ShardMigrated events (leader only)

Aggregate 2: LeadershipLease (Root)

Invariants enforced:

  • Invariant 1: Single leader per term
  • Invariant 5: Leader is an active node

Entities:

  • LeadershipLease (root): Represents the current leadership claim
    • leaderID: String - which node holds the lease
    • term: uint64 - monotonically increasing term number
    • expiresAt: Timestamp - when this lease expires (now + LeaderLeaseTimeout)
    • startedAt: Timestamp - when leader was elected

Value Objects:

  • None (all properties immutable; lease is replaced, not mutated)

Lifecycle:

  • Created when: A node wins election and creates the "leader" key in NATS KV
  • Destroyed when: Lease expires and is not renewed, or leader resigns
  • Transitions:
    • TryBecomeLeader → attempt atomic create; if fails, maybe claim expired lease
    • RenewLease (every 3s) → atomically update expiresAt to now + 10s
    • LeaseExpired (detected) → remove from KV, allow new election
    • NodeFailed (detected) → if failed node is leader, expiration will trigger new election

Behavior Methods:

  • tryAcquire(nodeID) → LeaderElected event (if succeeds)
  • renewLease(nodeID) → LeadershipRenewed event (internal, not exposed as command)
  • isExpired() → Boolean
  • isLeader(nodeID) → Boolean

Invariant enforcement mechanism:

  • Atomic operations in NATS KV: Only one node can successfully create "leader" key (or update with correct revision), ensuring single leader per term.
  • Lease expiration: If leader crashes without renewing, lease expires after 10s, allowing another node to claim it.
  • Revision-based updates: Update to lease must include correct revision (optimistic concurrency control), preventing stale leader from renewing.

Aggregate 3: ShardAssignment (Root)

Invariants enforced:

  • Invariant 2: All active shards have owners
  • Invariant 3: Shards assigned only to healthy nodes

Entities:

  • ShardAssignment (root): Maps shards to their owning nodes
    • version: uint64 - incremented on each change, enables version comparison for replication
    • assignments: Map[ShardID → []NodeID] - shard to primary+replica nodes
    • nodes: Map[NodeID → NodeInfo] - snapshot of active nodes at assignment time
    • updateTime: Timestamp

Value Objects:

  • None (structure is just data; immutability via replacement)

Lifecycle:

  • Created when: Cluster initializes (empty assignments)
  • Updated when: Leader calls rebalanceTopology() → new ShardAssignment created (old one replaced)
  • Destroyed when: Cluster shuts down

Behavior Methods:

  • assignShard(shardID, nodeList) → validates all nodes in nodeList are active
  • rebalanceFromTopology(topology, strategy) → calls strategy to compute new assignments
  • validateAssignments() → checks all shards assigned, all owners healthy
  • getAssignmentsForNode(nodeID) → []ShardID

Validation Rules:

  • All nodes in assignment must be in nodes map with status=Active
  • All shard IDs in [0, ShardCount) must appear in assignments (no orphans)
  • Replication factor respected (each shard has 1..ReplicationFactor owners)

Commands

Commands represent user or system intents to change the cluster state. Only aggregates handle commands.

Command 1: JoinCluster

  • Aggregate: Cluster
  • Actor: Node joining (or discovery service announcing)
  • Input: nodeID, address, port, capacity, metadata
  • Validates:
    • nodeID is not empty
    • capacity > 0
    • address is reachable (optional)
  • Invariants enforced: Invariant 2 (rebalance if needed)
  • Success: NodeJoined event published
  • Failure: DuplicateNodeError (node already in cluster), ValidationError

Command 2: ElectLeader

  • Aggregate: LeadershipLease
  • Actor: Node attempting election (triggered periodically)
  • Input: nodeID, currentTerm
  • Validates:
    • nodeID matches a current cluster member (in Cluster.nodes, status=Active)
    • Can attempt if no current leader OR lease is expired
  • Invariants enforced: Invariant 1, 5
  • Success: LeaderElected event published (if atomic create succeeds); LeadershipRenewed (if claim expired lease)
  • Failure: LeaderElectionFailed (atomic operation lost), NodeNotHealthy

Command 3: RenewLeadership

  • Aggregate: LeadershipLease
  • Actor: Current leader (triggered every 3s)
  • Input: nodeID, currentTerm
  • Validates:
    • nodeID is current leader
    • term matches current term
    • node status is Active (else fail and lose leadership)
  • Invariants enforced: Invariant 1, 5
  • Success: LeadershipRenewed (internal event, triggers heartbeat log entry)
  • Failure: LeadershipLost (node is no longer healthy or lost atomic update race)

Command 4: MarkNodeFailed

  • Aggregate: Cluster
  • Actor: System (monitoring service) or leader (if heartbeat misses)
  • Input: nodeID, reason
  • Validates:
    • nodeID exists in cluster
    • node is currently Active (don't re-fail already-failed nodes)
  • Invariants enforced: Invariant 2, 3, 5 (rebalance to move shards off failed node)
  • Success: NodeFailed event published; RebalanceTriggered (if leader)
  • Failure: NodeNotFound, NodeAlreadyFailed

Command 5: AssignShards

  • Aggregate: ShardAssignment (+ reads Cluster topology)
  • Actor: Leader (only leader can assign)
  • Input: nodeID (must be leader), newAssignments (Map[ShardID → []NodeID])
  • Validates:
    • nodeID is current leader
    • all nodes in assignments are Active
    • all shards in [0, ShardCount) are covered
    • replication factor respected
  • Invariants enforced: Invariant 2, 3 (assignment only valid if all nodes healthy)
  • Success: ShardAssigned event published with new ShardMap
  • Failure: NotLeader, InvalidAssignment (node not found), UnhealthyNode, IncompleteAssignment (missing shards)

Command 6: RebalanceShards

  • Aggregate: Cluster (orchestrates) + ShardAssignment (executes)
  • Actor: Leader (triggered by node changes or periodic check)
  • Input: nodeID (must be leader), strategy (optional placement strategy)
  • Validates:
    • nodeID is current leader
    • cluster has active nodes
  • Invariants enforced: Invariant 2 (all shards still assigned), Invariant 3 (only to healthy nodes)
  • Success: RebalancingCompleted event; zero or more ShardMigrated events (one per shard moved)
  • Failure: NotLeader, NoActiveNodes, RebalancingFailed (unexpected topology change mid-rebalance)

Events

Events represent facts that happened. They are published after successful command execution.

Event 1: NodeJoined

  • Triggered by: JoinCluster command
  • Aggregate: Cluster
  • Data: nodeID, address, port, capacity, metadata, timestamp
  • Consumed by:
    • Cluster (adds node to ring)
    • Policies (RebalancingTriggerPolicy)
  • Semantics: A new node entered the cluster and is ready to host actors
  • Immutability: Once published, never changes

Event 2: NodeDiscovered

  • Triggered by: NodeDiscovery announces node via NATS pub/sub (implicit)
  • Aggregate: Cluster (discovery feeds into cluster topology)
  • Data: nodeID, nodeInfo, timestamp
  • Consumed by: Cluster topology sync
  • Semantics: Node became visible to the cluster; may be new or rediscovered after network partition
  • Note: Implicit event; not explicitly commanded, but captured in domain language

Event 3: LeaderElected

  • Triggered by: ElectLeader command (atomic KV create succeeds) or ReclaimExpiredLease
  • Aggregate: LeadershipLease
  • Data: leaderID, term, expiresAt, startedAt, timestamp
  • Consumed by:
    • Cluster (updates currentLeaderID)
    • Policies (LeaderElectionCompletePolicy)
  • Semantics: A node has acquired leadership for the given term
  • Guarantee: At most one node can succeed in creating this event per term

Event 4: LeadershipLost

  • Triggered by: Lease expires (detected by monitorLeadership watcher) or RenewLeadership fails
  • Aggregate: LeadershipLease
  • Data: leaderID, term, reason (LeaseExpired, FailedToRenew, NodeFailed), timestamp
  • Consumed by:
    • Cluster (clears currentLeaderID)
    • Policies (trigger new election)
  • Semantics: The leader is no longer valid and coordination authority is vacant
  • Trigger: No renewal received for 10s, or atomic update fails

Event 5: LeadershipRenewed

  • Triggered by: RenewLeadership command (succeeds every 3s)
  • Aggregate: LeadershipLease
  • Data: leaderID, term, expiresAt, timestamp
  • Consumed by: Internal use (heartbeat signal); not published to other contexts
  • Semantics: Leader is alive and ready to coordinate
  • Frequency: Every 3s per leader

Event 6: ShardAssigned

  • Triggered by: AssignShards command or RebalanceShards command
  • Aggregate: ShardAssignment
  • Data: shardID, nodeIDs (primary + replicas), version, timestamp
  • Consumed by:
    • ShardManager (updates routing)
    • Policies (ShardOwnershipPolicy)
    • Other contexts (if they subscribe to shard topology changes)
  • Semantics: Shard N is now owned by these nodes (primary first)
  • Bulk event: Often published multiple times in one rebalance operation

Event 7: NodeFailed

  • Triggered by: MarkNodeFailed command
  • Aggregate: Cluster
  • Data: nodeID, reason (HeartbeatTimeout, AdminMarked, etc.), timestamp
  • Consumed by:
    • Cluster (removes from active pool)
    • Policies (RebalancingTriggerPolicy, actor migration)
    • Other contexts (may need to relocate actors)
  • Semantics: Node is unresponsive and should be treated as offline
  • Detection: heartbeat miss after 90s, or explicit admin action

Event 8: NodeLeft

  • Triggered by: Node gracefully shuts down (announceNode(NodeLeft)) or MarkNodeFailed (for draining)
  • Aggregate: Cluster
  • Data: nodeID, reason (GracefulShutdown, AdminRemoved, etc.), timestamp
  • Consumed by: Policies (same as NodeFailed, triggers rebalance)
  • Semantics: Node is intentionally leaving and will not rejoin
  • Difference from NodeFailed: Intent signal; failed nodes might rejoin after network partition heals

Event 9: ShardMigrated

  • Triggered by: RebalanceShards command (one event per shard reassigned)
  • Aggregate: Cluster
  • Data: shardID, fromNodes (old owners), toNodes (new owners), timestamp
  • Consumed by:
    • Local runtime (via ShardManager; triggers actor migration)
    • Other contexts (if they track actor locations)
  • Semantics: A shard's ownership changed; actors on that shard may need to migrate
  • Migration strategy: Application owns how to move actors (via ActorMigration); cluster just signals the change

Event 10: RebalancingTriggered

  • Triggered by: RebalanceShards command (start)
  • Aggregate: Cluster
  • Data: leaderID, reason (NodeJoined, NodeFailed, Manual), timestamp
  • Consumed by: Monitoring/debugging
  • Semantics: Leader has initiated a rebalancing cycle
  • Note: Informational; subsequent ShardMigrated events describe the actual changes

Event 11: RebalancingCompleted

  • Triggered by: RebalanceShards command (finish)
  • Aggregate: Cluster
  • Data: leaderID, completedAt, migrationsCount, timestamp
  • Consumed by: Monitoring/debugging, other contexts may wait for this before proceeding
  • Semantics: All shard migrations have been assigned; doesn't mean they're complete on actors
  • Note: ShardMigrated is the signal to move actors; this is the coordination signal

Policies

Policies are automated reactions to events. They connect events to commands across aggregates and contexts.

Policy 1: Single Leader Policy

  • Trigger: When LeadershipLost event
  • Action: Any node can attempt ElectLeader command
  • Context: Only one will succeed due to atomic NATS KV operation
  • Rationale: Ensure leadership is re-established quickly after vacancy
  • Implementation: electionLoop in LeaderElection runs every 2s, calls tryBecomeLeader if not leader

Policy 2: Lease Renewal Policy

  • Trigger: Periodic timer (every 3s)
  • Action: If IsLeader, execute RenewLeadership command
  • Context: Heartbeat mechanism to prove leader is alive
  • Rationale: Detect leader failure via lease expiration after 10s inactivity
  • Implementation: leaseRenewalLoop in LeaderElection; failure triggers loseLeadership()

Policy 3: Lease Expiration Policy

  • Trigger: When LeadershipLease.expiresAt < now (detected by monitorLeadership watcher)
  • Action: Clear currentLeader, publish LeadershipLost, trigger SingleLeaderPolicy
  • Context: Automatic failover when leader stops renewing
  • Rationale: Prevent stale leaders from coordinating during network partitions
  • Implementation: monitorLeadership watches "leader" KV key; if deleted or expired, calls handleLeadershipUpdate

Policy 4: Node Heartbeat Policy

  • Trigger: Periodic timer (every 30s) - NodeDiscovery announces
  • Action: Publish node status via NATS "aether.discovery" subject
  • Context: Membership discovery; all nodes broadcast presence
  • Rationale: Other nodes learn topology via heartbeats; leader detects failures via absence
  • Implementation: NodeDiscovery.Start() runs heartbeat ticker

Policy 5: Node Failure Detection Policy

  • Trigger: When NodeUpdate received with LastSeen > 90s ago
  • Action: Mark node as NodeStatusFailed; if leader, trigger RebalanceShards
  • Context: Eventual failure detection (passive, via heartbeat miss)
  • Rationale: Failed nodes may still hold shard assignments; rebalance moves shards to healthy nodes
  • Implementation: handleNodeUpdate checks LastSeen and marks nodes failed; checkNodeHealth periodic check

Policy 6: Shard Rebalancing Trigger Policy

  • Trigger: When NodeJoined, NodeLeft, or NodeFailed event
  • Action: If leader, execute RebalanceShards command
  • Context: Topology change → redistribute actors
  • Rationale: New node should get load; failed node's shards must be reassigned
  • Implementation: handleNodeUpdate calls triggerShardRebalancing if leader

Policy 7: Shard Ownership Enforcement Policy

  • Trigger: When ShardAssigned event
  • Action: Update local ShardMap; nodes use this for actor routing
  • Context: All nodes must agree on shard ownership for routing consistency
  • Rationale: Single source of truth (published by leader) prevents routing conflicts
  • Implementation: ClusterManager receives ShardAssigned via NATS; updates shardMap

Policy 8: Shard Coverage Policy

  • Trigger: Periodic check (every 5 min) or after NodeFailed
  • Action: Validate all shards in [0, ShardCount) are assigned; if any missing, trigger RebalanceShards
  • Context: Safety check to prevent shard orphaning
  • Rationale: Ensure no actor can be born on an unassigned shard
  • Implementation: rebalanceLoop calls triggerShardRebalancing with reason "periodic rebalance check"

Policy 9: Leader-Only Rebalancing Policy

  • Trigger: RebalanceShards command
  • Action: Validate nodeID is currentLeader before executing
  • Context: Only leader can initiate topology changes
  • Rationale: Prevent cascading rebalancing from multiple nodes; single coordinator
  • Implementation: triggerShardRebalancing checks IsLeader() at start

Policy 10: Graceful Shutdown Policy

  • Trigger: NodeDiscovery.Stop() called
  • Action: Publish NodeLeft event
  • Context: Signal that this node is intentionally leaving
  • Rationale: Other nodes should rebalance shards away from this node; different from failure
  • Implementation: Stop() calls announceNode(NodeLeft) before shutting down

Read Models

Read models project state for queries. They have no invariants and can be eventually consistent.

Read Model 1: GetClusterTopology

  • Purpose: What nodes are currently in the cluster?
  • Data:
    • nodes: []NodeInfo (filtered to status=Active only)
    • timestamp: When snapshot was taken
  • Source: Cluster.nodes, filtered by status != Failed
  • Updated: After NodeJoined, NodeLeft, NodeFailed events
  • Queryable by: nodeID, status, capacity, load
  • Eventual consistency: Replica nodes lag leader by a few heartbeats

Read Model 2: GetLeader

  • Purpose: Who is the current leader?
  • Data:
    • leaderID: Current leader node ID, or null if no leader
    • term: Leadership term number
    • expiresAt: When current leadership lease expires
    • confidence: "high" (just renewed), "medium" (recent), "low" (about to expire)
  • Source: LeadershipLease
  • Updated: After LeaderElected, LeadershipRenewed, LeadershipLost events
  • Queryable by: leaderID, term, expiration time
  • Eventual consistency: Non-leader nodes lag by up to 10s (lease timeout)

Read Model 3: GetShardAssignments

  • Purpose: Where does each shard live?
  • Data:
    • shardID: Shard number
    • primaryNode: Node ID (shardMap.Shards[shardID][0])
    • replicaNodes: []NodeID (shardMap.Shards[shardID][1:])
    • version: ShardMap version (for optimistic concurrency)
  • Source: Cluster.shardMap
  • Updated: After ShardAssigned, ShardMigrated events
  • Queryable by: shardID, nodeID (which shards does node own?)
  • Eventual consistency: Replicas lag leader by one NATS publish; consistent within a term

Read Model 4: GetNodeHealth

  • Purpose: Is a given node healthy?
  • Data:
    • nodeID: Node identifier
    • status: Active | Draining | Failed
    • lastSeen: Last heartbeat timestamp
    • downForSeconds: (now - lastSeen)
  • Source: Cluster.nodes[nodeID]
  • Updated: After NodeJoined, NodeUpdated, NodeFailed events
  • Queryable by: nodeID, status threshold (e.g., "give me all failed nodes")
  • Eventual consistency: Non-leader nodes lag by 30s (heartbeat interval)

Read Model 5: GetRebalancingStatus

  • Purpose: Is rebalancing in progress? How many shards moved?
  • Data:
    • isRebalancing: Boolean
    • startedAt: Timestamp
    • reason: "node_joined" | "node_failed" | "periodic" | "manual"
    • completedCount: Number of shards finished
    • totalCount: Total shards to move
  • Source: RebalancingTriggered, ShardMigrated, RebalancingCompleted events
  • Updated: On rebalancing events
  • Queryable by: current status, started within N seconds
  • Eventual consistency: Replicas lag by one NATS publish

Value Objects

Value Object 1: NodeInfo

Represents a physical node in the cluster.

Fields:

  • ID: string - unique identifier
  • Address: string - IP or hostname
  • Port: int - NATS port
  • Status: NodeStatus enum (Active, Draining, Failed)
  • Capacity: float64 - max load capacity
  • Load: float64 - current load
  • LastSeen: time.Time - last heartbeat
  • Timestamp: time.Time - when created/updated
  • Metadata: map[string]string - arbitrary tags (region, version, etc.)
  • IsLeader: bool - is this the leader?
  • VMCount: int - how many actors on this node
  • ShardIDs: []int - which shards are assigned

Equality: Two NodeInfos are equal if all fields match (identity-based for clustering purposes, but immutable)

Validation:

  • ID non-empty
  • Capacity > 0
  • Status in {Active, Draining, Failed}
  • Port in valid range [1, 65535]

Value Object 2: ShardMap

Represents the current shard-to-node assignment snapshot.

Fields:

  • Version: uint64 - incremented on each change; used for optimistic concurrency
  • Shards: Map[ShardID → []NodeID] - shard to [primary, replica1, replica2, ...]
  • Nodes: Map[NodeID → NodeInfo] - snapshot of nodes known at assignment time
  • UpdateTime: time.Time - when created

Equality: Two ShardMaps are equal if Version and Shards are equal (Nodes is metadata)

Validation:

  • All shard IDs in [0, ShardCount)
  • All node IDs in Shards exist in Nodes
  • All nodes in Nodes have status=Active
  • Replication factor respected (1 ≤ len(Shards[sid]) ≤ ReplicationFactor)

Immutability: ShardMap is never mutated; rebalancing creates a new ShardMap


Value Object 3: LeadershipLease

Represents a leader's claim on coordination authority.

Fields:

  • LeaderID: string - node ID holding the lease
  • Term: uint64 - monotonically increasing term number
  • ExpiresAt: time.Time - when lease is no longer valid
  • StartedAt: time.Time - when leader was elected

Equality: Two leases are equal if LeaderID, Term, and ExpiresAt match

Validation:

  • LeaderID non-empty
  • Term ≥ 0
  • ExpiresAt > StartedAt
  • ExpiresAt - StartedAt == LeaderLeaseTimeout

Lifecycle:

  • Created: node wins election
  • Renewed: every 3s, ExpiresAt extended
  • Expired: if ExpiresAt < now and not renewed
  • Replaced: next term when new leader elected

Value Object 4: Term

Represents a leadership term (could be extracted for clarity).

Fields:

  • Number: uint64 - term counter

Semantics: Monotonically increasing; each new leader gets a higher term. Used to detect stale messages.


Code Analysis

Intended vs Actual: ClusterManager

Intended (from Domain Model):

  • Root aggregate owning Cluster topology
  • Enforces invariants: shard coverage, healthy node assignments, rebalancing triggers
  • Commands: JoinCluster, MarkNodeFailed, RebalanceShards
  • Events: NodeJoined, NodeFailed, ShardAssigned, ShardMigrated

Actual (from /cluster/manager.go):

  • Partially aggregate-like: owns nodes, shardMap, hashRing
  • Lacks explicit command methods: has handleClusterMessage() but not named commands like JoinCluster()
  • Lacks explicit event publishing: updates state but doesn't publish domain events
  • Invariant enforcement scattered: node failure detection in handleNodeUpdate(), but no central validation
  • Missing behavior: shard assignment logic in ShardManager, not in Cluster aggregate

Misalignment:

  1. Anemic aggregate: ClusterManager reads/writes state but doesn't enforce invariants or publish events
  2. Responsibility split: Cluster topology (Manager) vs shard assignment (ShardManager) vs leadership (LeaderElection) are not unified under one aggregate root
  3. No explicit commands: Node updates handled via generic message dispatcher, not domain-language commands
  4. No event sourcing: State changes don't produce events

Gaps:

  • No JoinCluster command handler
  • No MarkNodeFailed command handler (only handleNodeUpdate which detects failures)
  • No explicit ShardAssigned/ShardMigrated events
  • Rebalancing triggers exist (triggerShardRebalancing) but not as domain commands

Intended vs Actual: LeaderElection

Intended (from Domain Model):

  • Root aggregate owning LeadershipLease invariant (single leader per term)
  • Commands: ElectLeader, RenewLeadership
  • Events: LeaderElected, LeadershipLost, LeadershipRenewed

Actual (from /cluster/leader.go):

  • Correctly implements lease-based election with NATS KV
  • Enforces single leader via atomic operations (create, update with revision)
  • Has implicit command pattern (tryBecomeLeader, renewLease, resignLeadership)
  • Has callbacks for leadership change, but no explicit event publishing

Alignment:

  • Atomic operations correctly enforce Invariant 1 (single leader)
  • Lease renewal every 3s enforces lease validity
  • Lease expiration detected via watcher
  • Leadership transitions (elected, lost) well-modeled

Gaps:

  • Events not explicitly published; callbacks used instead
  • No event sourcing (events should be recorded in event store, not just callbacks)
  • No term-based validation (could reject stale messages with old term)
  • Could be more explicit about LeaderElected event vs just callback

Intended vs Actual: ConsistentHashRing

Intended (from Domain Model):

  • Used by ShardAssignment to compute which node owns which shard
  • Policy: shards assigned via consistent hashing
  • Minimizes reshuffling on node join/leave

Actual (from /cluster/hashring.go):

  • Correctly implements consistent hash ring with virtual nodes
  • AddNode/RemoveNode operations are clean
  • GetNode(key) returns responsible node; used for actor placement

Alignment:

  • Good separation of concerns (ring is utility, not aggregate)
  • Virtual nodes (150 per node) reduce reshuffling on node change
  • Immutable ring structure (recreated on changes)

Gaps:

  • Not actively used by ShardAssignment (ShardManager has own hash logic)
  • Could be used by RebalanceShards policy to compute initial assignments
  • Currently more of a utility than a policy

Intended vs Actual: ShardManager

Intended (from Domain Model):

  • ShardAssignment aggregate managing shard-to-node mappings
  • Commands: AssignShard, RebalanceShards (via PlacementStrategy)
  • Enforces invariants: all shards assigned, only to healthy nodes
  • Emits ShardAssigned events

Actual (from /cluster/shard.go):

  • Owns ShardMap, but like ClusterManager, is more of a data holder than aggregate
  • Has methods: AssignShard, RebalanceShards (delegates to PlacementStrategy)
  • Lacks invariant validation (doesn't check if nodes are healthy)
  • Lacks event publishing

Alignment:

  • PlacementStrategy pattern allows different algorithms (good design)
  • ConsistentHashPlacement exists but is stubbed

Gaps:

  • ShardManager.RebalanceShards not integrated with ClusterManager's decision to rebalance
  • No event publishing on shard changes
  • Invariant validation needed: validate nodes in assignments are healthy

Intended vs Actual: NodeDiscovery

Intended (from Domain Model):

  • Detects nodes via NATS heartbeats
  • Publishes NodeJoined, NodeUpdated, NodeLeft events via announceNode
  • Triggers policies (node failure detection, rebalancing)

Actual (from /cluster/discovery.go):

  • Heartbeats every 30s via announceNode
  • Subscribes to "aether.discovery" channel
  • Publishes NodeUpdate messages, not domain events

Alignment:

  • Heartbeat mechanism good; detected failure via 90s timeout in ClusterManager
  • Message-based communication works for event bus

Gaps:

  • NodeUpdate is not a domain event; should publish NodeJoined, NodeUpdated, NodeLeft as explicit events
  • Could be clearer about lifecycle: Start announces NodeJoined, Stop announces NodeLeft

Intended vs Actual: DistributedVM

Intended (from Domain Model):

  • Orchestrates all cluster components (discovery, election, coordination, sharding)
  • Not itself an aggregate; more of a façade/orchestrator

Actual (from /cluster/distributed.go):

  • Correctly orchestrates: discovery + cluster manager + sharding + local runtime
  • DistributedVMRegistry provides VMRegistry interface to ClusterManager
  • Good separation: doesn't force topology decisions on runtime

Alignment:

  • Architecture clean; each component has clear responsibility
  • Decoupling via interfaces (Runtime, VirtualMachine, VMProvider) is good

Gaps:

  • No explicit orchestration logic (Start method incomplete; only shown first 100 lines)
  • Could coordinate startup order more explicitly

Refactoring Backlog

Refactoring 1: Extract Cluster Aggregate from ClusterManager

Current: ClusterManager is anemic; only stores state Target: ClusterManager becomes true aggregate root enforcing invariants

Steps:

  1. Add explicit command methods to ClusterManager:
    • JoinCluster(nodeInfo NodeInfo) error
    • MarkNodeFailed(nodeID string) error
    • AssignShards(shardMap ShardMap) error
    • RebalanceTopology(reason string) error
  2. Each command:
    • Validates preconditions
    • Calls aggregate behavior (private methods)
    • Publishes events
    • Returns result
  3. Add event publishing:
    • Create EventPublisher interface in ClusterManager
    • Publish NodeJoined, NodeFailed, ShardAssigned, ShardMigrated events
    • Events captured in event store (optional, or via NATS pub/sub)

Impact: Medium - changes ClusterManager interface but not external APIs yet Priority: High - unblocks event-driven integration with other contexts


Refactoring 2: Extract ShardAssignment Commands from RebalanceShards

Current: ShardManager.RebalanceShards delegates to PlacementStrategy; no validation of healthy nodes Target: ShardAssignment commands validate invariants

Steps:

  1. Add to ShardManager:
    • AssignShards(assignments map[int][]string, nodes map[string]*NodeInfo) error
      • Validates: all nodes exist and are Active
      • Validates: all shards in [0, ShardCount) assigned
      • Validates: replication factor respected
    • ValidateAssignments() error
  2. Move shard validation from coordinator to ShardManager
  3. Publish ShardAssigned events on successful assignment
  4. Update ClusterManager to call ShardManager.AssignShards instead of directly mutating ShardMap

Impact: Medium - clarifies shard aggregate, adds validation Priority: High - prevents invalid shard assignments


Refactoring 3: Publish Domain Events from LeaderElection

Current: LeaderElection uses callbacks; no event sourcing Target: Explicit event publishing for leader changes

Steps:

  1. Add EventPublisher interface to LeaderElection
  2. In becomeLeader: publish LeaderElected event
  3. In loseLeadership: publish LeadershipLost event
  4. Optional: publish LeadershipRenewed on each renewal (for audit trail)
  5. Events include: leaderID, term, expiresAt, timestamp
  6. Consumers subscribe via NATS and react (no longer callbacks)

Impact: Medium - changes LeaderElection interface Priority: Medium - improves observability and enables event sourcing


Refactoring 4: Unify Node Failure Detection and Rebalancing

Current: Node failure detected in handleNodeUpdate (90s timeout) + periodic checkNodeHealth; rebalancing trigger spread across multiple methods Target: Explicit MarkNodeFailed command, single rebalancing trigger

Steps:

  1. Create explicit MarkNodeFailed command handler
  2. Move node failure detection logic to ClusterManager.markNodeFailed()
  3. Consolidate node failure checks (remove duplicate in checkNodeHealth)
  4. Trigger rebalancing only from MarkNodeFailed, not scattered
  5. Add RebalancingTriggered event before starting rebalance

Impact: Low - refactoring existing logic, not new behavior Priority: Medium - improves clarity


Refactoring 5: Implement PlacementStrategy for Rebalancing

Current: ConsistentHashPlacement.RebalanceShards is stubbed Target: Real rebalancing logic using consistent hashing

Steps:

  1. Implement ConsistentHashPlacement.RebalanceShards:
    • Input: current ShardMap, updated nodes (may have added/removed)
    • Output: new ShardMap with shards redistributed via consistent hash
    • Minimize movement: use virtual nodes to keep most shards in place
  2. Add RebalancingStrategy interface if other strategies needed (e.g., load-aware)
  3. Test: verify adding/removing node only reshuffles ~1/N shards

Impact: Medium - core rebalancing logic, affects all topology changes Priority: High - currently rebalancing doesn't actually redistribute


Refactoring 6: Add Node Health Check Endpoint

Current: No way to query node health directly Target: Read model for GetNodeHealth

Steps:

  1. Add method to ClusterManager: GetNodeHealth(nodeID string) NodeHealthStatus
  2. Return: status, lastSeen, downForSeconds
  3. Expose via NATS request/reply (if distributed query needed)
  4. Test: verify timeout logic

Impact: Low - new query method, no state changes Priority: Low - nice to have for monitoring


Refactoring 7: Add Shard Migration Tracking

Current: ShardMigrated event published, but no tracking of migration progress Target: ActorMigration status tracking and completion callback

Steps:

  1. Add MigrationTracker in cluster package
  2. On ShardMigrated event: create migration record (pending)
  3. Application reports migration progress (in_progress, completed, failed)
  4. On completion: remove from tracker
  5. Rebalancing can wait for migrations to complete before declaring rebalance done

Impact: High - affects how rebalancing coordinates with application Priority: Medium - improves robustness (don't rebalance while migrations in flight)


Testing Strategy

Unit Tests

LeaderElection invariant tests:

  • Only one node can successfully create "leader" key → test atomic create succeeds once, fails second time
  • Lease expiration triggers new election → create expired lease, verify election succeeds
  • Lease renewal extends expiry → create lease, renew, verify new expiry is ~10s from now
  • Stale leader can't renew → mark node failed, verify renewal fails

Cluster topology invariant tests:

  • NodeJoined adds to hashRing → call addNode, verify GetNode routes consistently
  • NodeFailed triggers rebalance → call markNodeFailed, verify rebalance triggered
  • Shard coverage validated → rebalance with 100 nodes, 1024 shards, verify all shards assigned
  • Only healthy nodes get shards → assign to failed node, verify rejected

ShardManager invariant tests:

  • AssignShards validates node health → assign to failed node, verify error
  • RebalanceShards covers all shards → simulate topology change, verify no orphans
  • Virtual nodes minimize reshuffling → add node, verify < 1/N shards move

Integration Tests

Single leader election:

  • Create 3 cluster nodes
  • Verify exactly one becomes leader
  • Stop leader
  • Verify new leader elected within 10s
  • Test: leadership term increments

Node failure and recovery:

  • Create 5-node cluster with 100 shards
  • Mark node-2 failed
  • Verify shards reassigned from node-2 to others
  • Verify node-3 doesn't become unreasonably overloaded
  • Restart node-2
  • Verify shards rebalanced back

Graceful shutdown:

  • Create 3-node cluster
  • Gracefully stop node-1 (announces NodeLeft)
  • Verify no 90s timeout; rebalancing happens immediately
  • Compare to failure case (90s delay)

Split-brain recovery:

  • Create 3-node cluster: [A(leader), B, C]
  • Partition network: A isolated, B+C connected
  • Verify A loses leadership after 10s
  • Verify B or C becomes leader
  • Heal partition
  • Verify single leader, no conflicts (A didn't try to be leader again)

Rebalancing under load:

  • Create 5-node cluster, 100 shards, with actors running
  • Add node-6
  • Verify actors migrated off other nodes to node-6
  • No actors are orphaned (all still reachable)
  • Measure: reshuffled < 1/5 of shards

Chaos Testing

  • Leader failure mid-rebalance → verify rebalancing resumed by new leader
  • Network partition (leader isolated) → verify quorum (or lease) ensures no split-brain
  • Cascading failures → 5 nodes, fail 3 at once, verify cluster stabilizes
  • High churn → nodes join/leave rapidly, verify topology converges

Boundary Conditions and Limitations

Design Decisions

Why lease-based election instead of Raft?

  • Simpler to implement and reason about
  • Detect failure in 10s (acceptable for coordination)
  • Risk: split-brain if network partition persists > 10s and both partitions have nodes (mitigation: leader must renew in each partition; only one will have NATS connection)

Why leader-only rebalancing?

  • Prevent cascading rebalancing decisions
  • Single source of truth (leader decides topology)
  • Risk: leader bottleneck if rebalancing is expensive (mitigation: leader can delegate to algorithms, not compute itself)

Why consistent hashing instead of load-balancing?

  • Minimize shard movement on topology change (good for actor locality)
  • Deterministic without central state (nodes can independently compute assignments)
  • Risk: load imbalance if actors heavily skewed (mitigation: application can use custom PlacementStrategy)

Why 90s failure detection timeout?

  • 3 heartbeats missed (30s * 3) before declaring failure
  • Allows for some network jitter without false positives
  • Risk: slow failure detection (mitigation: application can force MarkNodeFailed if it detects failure faster)

Assumptions

  • NATS cluster is available: If NATS is down, cluster can't communicate (no failover without NATS)
  • Clocks are reasonably synchronized: Lease expiration depends on wall clock; major clock skew can break election
  • Network partitions are rare: Split-brain only possible if partition > 10s and leader isolated
  • Rebalancing is not time-critical: 5-min periodic check is default; no SLA on shard assignment latency

Known Gaps

  1. No quorum-based election: Single leader with lease; could add quorum for stronger consistency (Raft-like)
  2. No actor migration semantics: Who actually moves actors? Cluster signals ShardMigrated, but application must handle
  3. No topology versioning: ShardMap has version, but no way to detect if a node has an outdated topology
  4. No leader handoff during rebalancing: If leader fails mid-rebalance, new leader might redo already-started migrations
  5. No split-brain detection: Cluster can't detect if two leaders somehow exist (NATS KV prevents it, but cluster doesn't enforce it)

Alignment with Product Vision

Primitives Over Frameworks:

  • Cluster Coordination provides primitives (leader election, shard assignment), not a complete framework
  • Application owns actor migration strategy (via ShardManager PlacementStrategy)
  • Application owns failure response (can custom-implement node monitoring)

NATS-Native:

  • Leader election uses NATS KV for atomic operations
  • Node discovery uses NATS pub/sub for heartbeats
  • Shard topology can be published via NATS events

Event-Sourced:

  • All topology changes produce events (NodeJoined, NodeFailed, ShardAssigned, ShardMigrated)
  • Events enable audit trail and replay (who owns which shard when?)

Resource Conscious:

  • Minimal overhead: consistent hashing avoids per-node state explosion
  • Lease-based election lighter than Raft (no log replication)
  • Virtual nodes (150) on modest hardware

References

  • Lease-based election: Inspired by Chubby, Google's lock service
  • Consistent hashing: Karger et al., "Consistent Hashing and Random Trees"
  • Virtual nodes: Reduces reshuffling on topology change (Dynamo, Cassandra pattern)
  • NATS KV: Used for atomicity; alternatives: etcd, Consul (but less NATS-native)