Files
aether/.product-strategy/cluster/ARCHITECTURE.md
Hugo Nijhuis 271f5db444
Some checks failed
CI / build (push) Successful in 21s
CI / integration (push) Failing after 2m1s
Move product strategy documentation to .product-strategy directory
Organize all product strategy and domain modeling documentation into a
dedicated .product-strategy directory for better separation from code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-12 23:57:20 +01:00

36 KiB

Cluster Coordination: Architecture Reference

High-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Aether Cluster Runtime                     │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ DistributedVM (Orchestrator - not an aggregate)      │   │
│  │ ├─ Local Runtime (executes actors)                   │   │
│  │ ├─ NodeDiscovery (heartbeat → cluster awareness)    │   │
│  │ ├─ ClusterManager (Cluster aggregate root)          │   │
│  │ │  ├─ nodes: Map[ID → NodeInfo]                    │   │
│  │ │  ├─ shardMap: ShardMap (current assignments)     │   │
│  │ │  ├─ hashRing: ConsistentHashRing (util)          │   │
│  │ │  └─ election: LeaderElection                      │   │
│  │ └─ ShardManager (ShardAssignment aggregate)         │   │
│  │    ├─ shardCount: int                              │   │
│  │    ├─ shardMap: ShardMap                           │   │
│  │    └─ placement: PlacementStrategy                  │   │
│  └──────────────────────────────────────────────────────┘   │
│                          │ NATS                              │
│                          ▼                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ NATS Cluster                                         │   │
│  │ ├─ Subject: aether.discovery (heartbeats)           │   │
│  │ ├─ Subject: aether.cluster.* (messages)             │   │
│  │ └─ KeyValue: aether-leader-election (lease)         │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                               │
└─────────────────────────────────────────────────────────────┘

Aggregate Boundaries

Aggregate 1: Cluster (Root)

Owns node topology, shard assignments, and rebalancing decisions.

Cluster Aggregate
├─ Entities
│  └─ Cluster (root)
│     ├─ nodes: Map[NodeID → NodeInfo]
│     ├─ shardMap: ShardMap
│     ├─ hashRing: ConsistentHashRing
│     └─ currentLeaderID: string
│
├─ Commands
│  ├─ JoinCluster(nodeInfo)
│  ├─ MarkNodeFailed(nodeID)
│  ├─ AssignShards(shardMap)
│  └─ RebalanceShards(reason)
│
├─ Events
│  ├─ NodeJoined
│  ├─ NodeFailed
│  ├─ NodeLeft
│  ├─ ShardAssigned
│  ├─ ShardMigrated
│  └─ RebalancingTriggered
│
├─ Invariants Enforced
│  ├─ I2: All active shards have owners
│  ├─ I3: Shards only on healthy nodes
│  └─ I4: Assignments stable during lease
│
└─ Code Location: ClusterManager (cluster/manager.go)

Aggregate 2: LeadershipLease (Root)

Owns leadership claim and ensures single leader per term.

LeadershipLease Aggregate
├─ Entities
│  └─ LeadershipLease (root)
│     ├─ leaderID: string
│     ├─ term: uint64
│     ├─ expiresAt: time.Time
│     └─ startedAt: time.Time
│
├─ Commands
│  ├─ ElectLeader(nodeID)
│  └─ RenewLeadership(nodeID)
│
├─ Events
│  ├─ LeaderElected
│  ├─ LeadershipRenewed
│  └─ LeadershipLost
│
├─ Invariants Enforced
│  ├─ I1: Single leader per term
│  └─ I5: Leader is active node
│
└─ Code Location: LeaderElection (cluster/leader.go)

Aggregate 3: ShardAssignment (Root)

Owns shard-to-node mappings and validates assignments.

ShardAssignment Aggregate
├─ Entities
│  └─ ShardAssignment (root)
│     ├─ version: uint64
│     ├─ assignments: Map[ShardID → []NodeID]
│     ├─ nodes: Map[NodeID → NodeInfo]
│     └─ updateTime: time.Time
│
├─ Commands
│  ├─ AssignShard(shardID, nodeList)
│  └─ RebalanceFromTopology(nodes)
│
├─ Events
│  ├─ ShardAssigned
│  └─ ShardMigrated
│
├─ Invariants Enforced
│  ├─ I2: All shards assigned
│  └─ I3: Only healthy nodes
│
└─ Code Location: ShardManager (cluster/shard.go)

Command Flow Diagrams

Scenario 1: Node Joins Cluster

┌─────────┐ NodeJoined   ┌──────────────────┐
│New Node │─────────────▶│ClusterManager    │
└─────────┘              │.JoinCluster()    │
                         └────────┬─────────┘
                                  │
                    ┌─────────────┼─────────────┐
                    ▼             ▼             ▼
              ┌──────────┐  ┌──────────┐  ┌──────────┐
              │Validate  │  │Update    │  │Publish   │
              │ID unique │  │topology  │  │NodeJoined│
              │Capacity>0   hashRing  │  │event     │
              └────┬─────┘  └──────────┘  └──────────┘
                   │
              ┌────▼────────────────────────┐
              │Is this node leader?         │
              │If yes: trigger rebalance    │
              └─────────────────────────────┘
                        │
            ┌───────────┴───────────┐
            ▼                       ▼
    ┌──────────────────┐   ┌──────────────────┐
    │RebalanceShards   │   │(nothing)         │
    │   command        │   │                  │
    └──────────────────┘   └──────────────────┘
            │
            ▼
    ┌──────────────────────────────────┐
    │ConsistentHashPlacement           │
    │ .RebalanceShards()               │
    │ (compute new assignments)        │
    └────────────┬─────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────────┐
    │ShardManager.AssignShards()       │
    │ (validate & apply)               │
    └────────────┬─────────────────────┘
                 │
            ┌────┴──────────────────────┐
            ▼                           ▼
    ┌──────────────┐           ┌─────────────────┐
    │For each      │           │Publish          │
    │shard moved   │           │ShardMigrated    │
    │             │           │event per shard  │
    └──────────────┘           └─────────────────┘

Scenario 2: Node Failure Detected

┌──────────────────────┐
│Heartbeat timeout     │
│(LastSeen > 90s)      │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────────────┐
│ClusterManager                │
│.MarkNodeFailed()             │
│ ├─ Mark status=Failed        │
│ ├─ Remove from hashRing      │
│ └─ Publish NodeFailed event  │
└────────────┬─────────────────┘
             │
    ┌────────▼────────────┐
    │Is this node leader? │
    └────────┬────────────┘
             │
    ┌────────┴─────────────────┐
    │ YES                      │ NO
    ▼                          ▼
┌──────────────┐   ┌──────────────────┐
│Trigger       │   │(nothing)         │
│Rebalance     │   │                  │
└──────────────┘   └──────────────────┘
    │
    └─▶ [Same as Scenario 1 from RebalanceShards]

Scenario 3: Leader Election (Implicit)

┌─────────────────────────────────────┐
│All nodes: electionLoop runs every 2s│
└────────────┬────────────────────────┘
             │
    ┌────────▼────────────┐
    │Am I leader?         │
    └────────┬────────────┘
             │
    ┌────────┴──────────────────────────┐
    │ YES                               │ NO
    ▼                                   ▼
┌──────────────┐          ┌─────────────────────────┐
│Do nothing    │          │Should try election?     │
│(already      │          │ ├─ No leader exists     │
│leading)      │          │ ├─ Lease expired        │
└──────────────┘          │ └─ (other conditions)   │
                          └────────┬────────────────┘
                                   │
                         ┌─────────▼──────────┐
                         │try AtomicCreate    │
                         │"leader" key in KV  │
                         └────────┬───────────┘
                                  │
                    ┌─────────────┴──────────────┐
                    │ SUCCESS                    │ FAILED
                    ▼                            ▼
            ┌──────────────────┐      ┌──────────────────┐
            │Became Leader!    │      │Try claim expired │
            │Publish           │      │lease; if success,│
            │LeaderElected     │      │become leader     │
            └──────────────────┘      │Else: stay on     │
                    │                 │bench              │
                    ▼                 └──────────────────┘
            ┌──────────────────┐
            │Start lease       │
            │renewal loop      │
            │(every 3s)        │
            └──────────────────┘

Decision Trees

Decision 1: Is Node Healthy?

Query: Is Node X healthy?

┌─────────────────────────────────────┐
│Get node status from Cluster.nodes   │
└────────────┬────────────────────────┘
             │
    ┌────────▼────────────────┐
    │Check node.Status field  │
    └────────┬────────────────┘
             │
    ┌────────┴───────────────┬─────────────────────┬──────────────┐
    │                        │                     │              │
    ▼                        ▼                     ▼              ▼
┌────────┐          ┌────────────┐      ┌──────────────┐  ┌─────────┐
│Active  │          │Draining    │      │Failed        │  │Unknown  │
├────────┤          ├────────────┤      ├──────────────┤  ├─────────┤
│✓Healthy│          │⚠ Draining  │      │✗ Unhealthy   │  │✗ Error  │
│Can host│          │Should not  │      │Don't use for │  │         │
│shards  │          │get new     │      │sharding      │  │         │
└────────┘          │shards, but │      │Delete shards │  └─────────┘
                    │existing OK │      │ASAP          │
                    └────────────┘      └──────────────┘

Decision 2: Should This Node Rebalance Shards?

Command: RebalanceShards(nodeID, reason)

┌──────────────────────────────────┐
│Is nodeID the current leader?     │
└────────┬─────────────────────────┘
         │
    ┌────┴──────────────────┐
    │ YES                   │ NO
    ▼                       ▼
┌────────────┐      ┌──────────────────────┐
│Continue    │      │REJECT: NotLeader     │
│rebalancing │      │                      │
└─────┬──────┘      │Only leader can       │
      │             │initiate rebalancing  │
      │             └──────────────────────┘
      │
      ▼
┌─────────────────────────────────────┐
│Are there active nodes?              │
└────────┬────────────────────────────┘
         │
    ┌────┴──────────────────┐
    │ YES                   │ NO
    ▼                       ▼
┌────────────┐      ┌──────────────────────┐
│Continue    │      │REJECT: NoActiveNodes │
│rebalancing │      │                      │
└─────┬──────┘      │Can't assign shards   │
      │             │with no healthy nodes │
      │             └──────────────────────┘
      │
      ▼
┌──────────────────────────────────────────┐
│Call PlacementStrategy.RebalanceShards()  │
│ (compute new assignments)                │
└────────┬─────────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────┐
│Call ShardManager.AssignShards()          │
│ (validate & apply new assignments)       │
└────────┬─────────────────────────────────┘
         │
    ┌────┴──────────────────┐
    │ SUCCESS               │ FAILURE
    ▼                       ▼
┌────────────┐      ┌──────────────────────┐
│Publish     │      │Publish               │
│Shard       │      │RebalancingFailed     │
│Migrated    │      │event                 │
│events      │      │                      │
│            │      │Log error, backoff    │
│Publish     │      │try again in 5 min    │
│Rebalancing │      │                      │
│Completed   │      │                      │
└────────────┘      └──────────────────────┘

Decision 3: Can We Assign This Shard to This Node?

Command: AssignShard(shardID, nodeID)

┌──────────────────────────────────┐
│Is nodeID in Cluster.nodes?       │
└────────┬─────────────────────────┘
         │
    ┌────┴──────────────────┐
    │ YES                   │ NO
    ▼                       ▼
┌────────────┐      ┌──────────────────────┐
│Continue    │      │REJECT: NodeNotFound  │
│assignment  │      │                      │
└─────┬──────┘      │Can't assign shard    │
      │             │to non-existent node  │
      │             └──────────────────────┘
      │
      ▼
┌──────────────────────────────────────┐
│Check node.Status                     │
└────────┬─────────────────────────────┘
         │
    ┌────┴──────────────────┐
    │Active or Draining     │ Failed
    ▼                       ▼
┌────────────┐      ┌──────────────────────┐
│Continue    │      │REJECT: UnhealthyNode │
│assignment  │      │                      │
└─────┬──────┘      │Can't assign to       │
      │             │failed node; it can't │
      │             │execute shards        │
      │             └──────────────────────┘
      │
      ▼
┌──────────────────────────────────────┐
│Check replication factor              │
│ (existing nodes < replication limit?)│
└────────┬─────────────────────────────┘
         │
    ┌────┴──────────────────┐
    │ YES                   │ NO
    ▼                       ▼
┌────────────┐      ┌──────────────────────┐
│ACCEPT      │      │REJECT: TooManyReplicas
│Add node to │      │                      │
│shard's     │      │Already have max      │
│replica     │      │replicas for shard    │
│list        │      │                      │
└────────────┘      └──────────────────────┘

State Transitions

Cluster State Machine

                    ┌────────────────┐
                    │ Initializing   │
                    │ (no nodes)     │
                    └────────┬───────┘
                             │ NodeJoined
                             ▼
                    ┌────────────────┐
                    │ Single Node    │
                    │ (one node only)│
                    └────────┬───────┘
                             │ NodeJoined
                             ▼
    ┌────────────────────────────────────────────┐
    │ Multi-Node Cluster                         │
    │ ├─ Stable (healthy nodes, shards assigned) │
    │ ├─ Rebalancing (shards moving)             │
    │ └─ Degraded (failed node waiting for heal) │
    └────────────┬───────────────────────────────┘
                 │ (All nodes left or failed)
                 ▼
            ┌────────────────┐
            │ No Nodes       │
            │ (cluster dead) │
            └────────────────┘

Node State Machine (per node)

    ┌────────────────┐
    │ Discovered     │
    │ (new heartbeat)│
    └────────┬───────┘
             │ JoinCluster
             ▼
    ┌────────────────┐
    │ Active         │
    │ (healthy,      │
    │  can host      │
    │  shards)       │
    └────────┬───────┘
             │
    ┌────────┴──────────────┬─────────────────┐
    │                       │                 │
    │ (graceful)            │ (heartbeat miss)│
    ▼                       ▼                 ▼
┌────────────┐      ┌────────────────┐  ┌────────────────┐
│ Draining   │      │ Failed         │  │ Failed         │
│ (stop new  │      │ (timeout:90s)  │  │ (detected)     │
│  shards,   │      │                │  │ (admin/health) │
│  preserve  │      │ Rebalance      │  │                │
│  existing) │      │ shards ASAP    │  │ Rebalance      │
│            │      │                │  │ shards ASAP    │
│            │      │Recover?        │  │                │
│            │      │  ├─ Yes:       │  │Recover?        │
│            │      │  │  → Active   │  │  ├─ Yes:       │
│            │      │  └─ No:        │  │  │  → Active   │
│            │      │     → Deleted  │  │  └─ No:        │
│            │      │                │  │     → Deleted  │
│            │      └────────────────┘  └────────────────┘
└────┬───────┘
     │ Removed
     ▼
┌────────────────┐
│ Deleted        │
│ (left cluster) │
└────────────────┘

Leadership State Machine (per node)

    ┌──────────────────┐
    │ Not a Leader     │
    │ (waiting)        │
    └────────┬─────────┘
             │ Try Election (every 2s)
             │ Atomic create "leader" succeeds
             ▼
    ┌──────────────────┐
    │ Candidate        │
    │ (won election)   │
    └────────┬─────────┘
             │ Start lease renewal loop
             ▼
    ┌──────────────────┐
    │ Leader           │
    │ (holding lease)  │
    └────────┬─────────┘
             │
      ┌──────┴───────────┬──────────────────────┐
      │                  │                      │
      │ Renew lease      │ Lease expires        │ Graceful
      │ (every 3s)       │ (90s timeout)        │ shutdown
      │ ✓ Success        │ ✗ Failure            │
      ▼                  ▼                      ▼
    [stays]         ┌──────────────────┐   ┌──────────────────┐
                    │ Lost Leadership  │   │ Lost Leadership  │
                    │ (lease expired)  │   │ (graceful)       │
                    └────────┬─────────┘   └────────┬─────────┘
                             │                      │
                             └──────────┬───────────┘
                                        │
                                        ▼
                            ┌──────────────────┐
                            │ Not a Leader     │
                            │ (back to bench)  │
                            └──────────────────┘
                                    │
                                    └─▶ [Back to top]

Concurrency Model

Thread Safety

All aggregates use sync.RWMutex for thread safety:

type ClusterManager struct {
    mutex    sync.RWMutex              // Protects access to:
    nodes    map[string]*NodeInfo      //   - nodes
    shardMap *ShardMap                 //   - shardMap
    // ...
}

// Read operation (multiple goroutines)
func (cm *ClusterManager) GetClusterTopology() map[string]*NodeInfo {
    cm.mutex.RLock()                   // Shared lock
    defer cm.mutex.RUnlock()
    // ...
}

// Write operation (exclusive)
func (cm *ClusterManager) JoinCluster(nodeInfo *NodeInfo) error {
    cm.mutex.Lock()                    // Exclusive lock
    defer cm.mutex.Unlock()
    // ... (only one writer at a time)
}

Background Goroutines

┌─────────────────────────────────────────────────┐
│ DistributedVM.Start()                           │
├─────────────────────────────────────────────────┤
│                                                 │
│  ┌──────────────────────────────────────────┐  │
│  │ ClusterManager.Start()                   │  │
│  │ ├─ go election.Start()                   │  │
│  │ │  ├─ go electionLoop() [ticker: 2s]    │  │
│  │ │  ├─ go leaseRenewalLoop() [ticker: 3s]│  │
│  │ │  └─ go monitorLeadership() [watcher]  │  │
│  │ │                                        │  │
│  │ ├─ go monitorNodes() [ticker: 30s]      │  │
│  │ └─ go rebalanceLoop() [ticker: 5m]      │  │
│  └──────────────────────────────────────────┘  │
│                                                 │
│  ┌──────────────────────────────────────────┐  │
│  │ NodeDiscovery.Start()                    │  │
│  │ ├─ go announceNode() [ticker: 30s]       │  │
│  │ └─ Subscribe to "aether.discovery"       │  │
│  └──────────────────────────────────────────┘  │
│                                                 │
│  ┌──────────────────────────────────────────┐  │
│  │ NATS subscriptions                       │  │
│  │ ├─ "aether.cluster.*" → messages         │  │
│  │ ├─ "aether.discovery" → node updates     │  │
│  │ └─ "aether-leader-election" → KV watch   │  │
│  └──────────────────────────────────────────┘  │
│                                                 │
└─────────────────────────────────────────────────┘

Event Sequences

Event: Node Join, Rebalance, Shard Assignment

Timeline:

T=0s
  ├─ Node-4 joins cluster
  ├─ NodeDiscovery announces NodeJoined via NATS
  └─ ClusterManager receives and processes

T=0.1s
  ├─ ClusterManager.JoinCluster() executes
  ├─ Updates nodes map, hashRing
  ├─ Publishes NodeJoined event
  └─ If leader: triggers rebalancing

T=0.5s (if leader)
  ├─ ClusterManager.rebalanceLoop() fires (or triggerShardRebalancing)
  ├─ PlacementStrategy.RebalanceShards() computes new assignments
  └─ ShardManager.AssignShards() applies new assignments

T=1.0s
  ├─ Publishes ShardMigrated events (one per shard moved)
  ├─ All nodes subscribe to these events
  ├─ Each node routing table updated
  └─ Actors aware of new shard locations

T=1.5s onwards
  ├─ Actors on moved shards migrated (application layer)
  ├─ Actor Runtime subscribes to ShardMigrated
  ├─ Triggers actor migration via ActorMigration
  └─ Eventually: rebalancing complete

T=5m
  ├─ Periodic rebalance check (5m timer)
  ├─ If no changes: no-op
  └─ If imbalance detected: trigger rebalance again

Event: Node Failure Detection and Recovery

Timeline:

T=0s
  ├─ Node-2 healthy, last heartbeat received
  └─ Node-2.LastSeen = now

T=30s
  ├─ Node-2 healthcheck runs (every 30s timer)
  ├─ Publishes heartbeat
  └─ Node-2.LastSeen updated

T=60s
  ├─ (Node-2 still healthy)
  └─ Heartbeat received, LastSeen updated

T=65s
  ├─ Node-2 CRASH (network failure or process crash)
  ├─ No more heartbeats sent
  └─ Node-2.LastSeen = 60s

T=90s (timeout)
  ├─ ClusterManager.checkNodeHealth() detects timeout
  ├─ now - LastSeen > 90s → mark node Failed
  ├─ ClusterManager.MarkNodeFailed() executes
  ├─ Publishes NodeFailed event
  ├─ If leader: triggers rebalancing
  └─ (If not leader: waits for leader to rebalance)

T=91s (if leader)
  ├─ RebalanceShards triggered
  ├─ PlacementStrategy computes new topology without Node-2
  ├─ ShardManager.AssignShards() reassigns shards
  └─ Publishes ShardMigrated events

T=92s onwards
  ├─ Actors migrated from Node-2 to healthy nodes
  └─ No actor loss (assuming replication or migration succeeded)

T=120s (Node-2 recovery)
  ├─ Node-2 process restarts
  ├─ NodeDiscovery announces NodeJoined again
  ├─ Status: Active
  └─ (Back to Node Join sequence if leader decides)

Configuration & Tuning

Key Parameters

// From cluster/types.go
const (
    // LeaderLeaseTimeout: how long before leader must renew
    LeaderLeaseTimeout = 10 * time.Second

    // HeartbeatInterval: how often leader renews
    HeartbeatInterval = 3 * time.Second

    // ElectionTimeout: how often nodes try election
    ElectionTimeout = 2 * time.Second

    // Node failure detection (in manager.go)
    nodeFailureTimeout = 90 * time.Second
)

// From cluster/types.go
const (
    // DefaultNumShards: total shards in cluster
    DefaultNumShards = 1024

    // DefaultVirtualNodes: per-node virtual replicas for distribution
    DefaultVirtualNodes = 150
)

Tuning Guide

Parameter Current Rationale Trade-off
LeaderLeaseTimeout 10s Fast failure detection May cause thrashing in high-latency networks
HeartbeatInterval 3s Leader alive signal every 3s Overhead 3x per 9s window
ElectionTimeout 2s Retry elections frequently CPU cost, but quick recovery
NodeFailureTimeout 90s 3x heartbeat interval Tolerance for temp network issues
DefaultNumShards 1024 Good granularity for large clusters More shards = more metadata
DefaultVirtualNodes 150 Balance between distribution and overhead Lower = worse distribution, higher = more ring operations

Failure Scenarios & Recovery

Scenario A: Single Node Fails

Before:  [A (leader), B, C, D] with 1024 shards
         ├─ A: 256 shards (+ leader)
         ├─ B: 256 shards
         ├─ C: 256 shards
         └─ D: 256 shards

B crashes (no recovery) → waits 90s → marked Failed

After Rebalance:
         [A (leader), C, D] with 1024 shards
         ├─ A: 341 shards (+ leader)
         ├─ C: 341 shards
         └─ D: 342 shards

Recovery: Reshuffled ~1/3 shards (consistent hashing + virtual nodes minimizes this)


Scenario B: Leader Fails

Before:  [A (leader), B, C, D]

A crashes → waits 90s → marked Failed
            no lease renewal → lease expires after 10s

B, C, or D wins election → new leader
                        → triggers rebalance
                        → reshuffles A's shards

After:   [B (leader), C, D]

Recovery: New leader elected within 10s; rebalancing within 100s; no loss if replicas present


Scenario C: Network Partition

Before:  [A (leader), B, C, D]
Partition: {A} isolated | {B, C, D} connected

At T=10s (lease expires):
  ├─ A: can't reach NATS, can't renew → loses leadership
  ├─ B, C, D: A's lease expired, one wins election
  └─ New leader coordinates rebalance

Risk: If A can reach NATS (just isolated from app), might try to renew
      but atomic update fails because term mismatch

Result: Single leader maintained; no split-brain

Monitoring & Observability

Key Metrics to Track

# Cluster Topology
gauge: cluster.nodes.count                [active|draining|failed]
gauge: cluster.shards.assigned            [0, 1024]
gauge: cluster.shards.orphaned            [0, 1024]

# Leadership
gauge: cluster.leader.is_leader           [0, 1]
gauge: cluster.leader.term                [0, ∞]
gauge: cluster.leader.lease_expires_in_seconds  [0, 10]

# Rebalancing
counter: cluster.rebalancing.triggered    [reason]
gauge: cluster.rebalancing.active         [0, 1]
counter: cluster.rebalancing.completed    [shards_moved]
counter: cluster.rebalancing.failed       [reason]

# Node Health
gauge: cluster.node.heartbeat_latency_ms  [per node]
gauge: cluster.node.load                  [per node]
gauge: cluster.node.vm_count              [per node]
counter: cluster.node.failures            [reason]

Alerts

- Leader heartbeat missing > 5s → election may be stuck
- Rebalancing in progress > 5min → something wrong
- Orphaned shards > 0 → invariant violation
- Node failure > 50% of cluster → investigate

References