aether/.product-strategy/CAPABILITIES.md

# Aether Product Capabilities

This document maps Aether's domain models to product capabilities: "the system's ability to cause meaningful domain changes." These capabilities bridge domain models to product value.

## Summary

Aether provides 9 core capabilities across 5 bounded contexts. These capabilities enable teams building distributed, event-sourced systems in Go to:

- Store events durably with automatic conflict detection for safe concurrent writes
- Rebuild application state from immutable event history
- Isolate logical domains using namespace boundaries without architectural complexity
- Coordinate distributed clusters with automatic leader election and shard rebalancing
- Route domain events across nodes with flexible filtering and NATS-native delivery

## Capabilities

### Core Capabilities

#### Capability 1: Store Events Durably with Conflict Detection

**Bounded Context:** Event Sourcing

**Description:** The system can persist domain events as the source of truth while preventing lost writes through monotonic version enforcement and detecting concurrent modifications before data corruption occurs.

**Domain Support:**
- **Context:** Event Sourcing
- **Aggregate:** ActorEventStream (implicit - each actor has an event stream)
- **Commands:** `SaveEvent(event)`, `GetLatestVersion(actorID)`
- **Events:** `EventStored`, `VersionConflictDetected`
- **Invariant:** Monotonically increasing versions per actor; no version <= current version can be accepted

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/event.go` - Event type, VersionConflictError, EventStore interface
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - JetStreamEventStore implements SaveEvent with version validation
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/memory.go` - InMemoryEventStore for testing

**Business Value:**
- **Pain eliminated:** Developers no longer fear concurrent writes corrupting state
- **Job enabled:** Applications can safely update actors from multiple writers (no locks, no deadlocks)
- **Outcome:** Events form an immutable, append-only history; conflicts fail fast
- **Beneficiary:** Go teams building distributed systems

**Success Conditions:**
1. Multiple writers can attempt to update the same actor simultaneously
2. At most one writer succeeds; others receive ErrVersionConflict
3. Failed writers can inspect CurrentVersion and retry with next version
4. No events are lost or overwritten
5. Version conflicts are detected in <1ms (optimistic locking, not pessimistic)

---

#### Capability 2: Rebuild State from Event History

**Bounded Context:** Event Sourcing

**Description:** The system can derive any past or present application state by replaying events from a starting version forward. Snapshots optimize replay for long-lived actors.

**Domain Support:**
- **Context:** Event Sourcing
- **Aggregate:** ActorEventStream
- **Commands:** `GetEvents(actorID, fromVersion)`, `GetLatestSnapshot(actorID)`, `SaveSnapshot(snapshot)`
- **Events:** `ReplayStarted`, `ReplayCompleted`, `SnapshotCreated`
- **Invariant:** Event history is immutable; replay of same events always produces same state

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/event.go` - EventStore.GetEvents, SnapshotStore interface, ActorSnapshot type
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - Implements GetEvents with optional snapshots

**Business Value:**
- **Pain eliminated:** No need for separate read models; state can be reconstructed on demand
- **Job enabled:** Debugging "how did we get here?", rebuilding state after corruption, temporal queries
- **Outcome:** Complete audit trail; state at any point in time is reproducible
- **Beneficiary:** Platform builders (Flowmade teams), consultancies auditing systems

**Success Conditions:**
1. `GetEvents(actorID, 0)` returns all events in order
2. Replaying all events to state produces identical result every time
3. Snapshot reduces replay time from O(n) to O(1) after snapshot
4. Snapshots are optional; system works without them
5. Corrupted events are reported (ReplayError) without losing clean data

---

#### Capability 3: Enable Safe Concurrent Writes

**Bounded Context:** Optimistic Concurrency Control

**Description:** Multiple concurrent writers can update the same actor without locks. Conflicts are detected immediately; application controls retry strategy.

**Domain Support:**
- **Context:** Optimistic Concurrency Control (enabled by Event Sourcing capability)
- **Aggregate:** ActorEventStream
- **Commands:** `ReadVersion(actorID)`, `AttemptWrite(event)` [implicit in SaveEvent]
- **Events:** `WriteSucceeded`, `WriteFailed` (as VersionConflictError)
- **Invariant:** If two writes race, exactly one wins; the other sees the conflict

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/event.go` - VersionConflictError type with CurrentVersion details
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - Version validation in SaveEvent

**Business Value:**
- **Pain eliminated:** No need for pessimistic locking (locks, deadlocks, performance cliffs)
- **Job enabled:** High-concurrency writes (e.g., multi-user edits, distributed aggregates)
- **Outcome:** Application has visibility into conflicts; can implement backoff, circuit-break, or merge strategies
- **Beneficiary:** Go teams building collaborative or distributed systems

**Success Conditions:**
1. Two concurrent SaveEvent calls with same (actorID, currentVersion) both read version 1
2. First SaveEvent(version: 2) succeeds
3. Second SaveEvent(version: 2) receives VersionConflictError with CurrentVersion=2
4. Application can call GetLatestVersion again and retry with version 3
5. No database-level locks held during any of this (optimistic, not pessimistic)

---

#### Capability 4: Isolate Logical Domains Using Namespaces

**Bounded Context:** Namespace Isolation

**Description:** Events in one namespace are completely invisible to queries, subscriptions, and storage of another namespace. Namespaces enable logical boundaries without architectural complexity.

**Domain Support:**
- **Context:** Namespace Isolation
- **Concepts:** Namespace (value object, not aggregate)
- **Commands:** `PublishToNamespace(namespace, event)`, `SubscribeToNamespace(namespace)`, `GetEventsInNamespace(namespace, actorID)`
- **Events:** Events carry namespace context
- **Invariant:** Events stored with namespace X cannot be retrieved from namespace Y

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - JetStreamConfig.Namespace, stream name becomes "{namespace}_{streamName}"
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/eventbus.go` - Subscribe(namespacePattern), Publish(namespaceID)
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/nats_eventbus.go` - NATS subject routing with namespace isolation

**Business Value:**
- **Pain eliminated:** Multi-tenant or multi-domain systems don't need complex isolation logic in application code
- **Job enabled:** Separate bounded contexts can coexist on same cluster without leaking events
- **Outcome:** Storage-level isolation ensures data cannot leak between namespaces
- **Beneficiary:** Platform builders (Flowmade), SaaS products using Aether

**Success Conditions:**
1. Create two stores with different namespaces: "tenant-a", "tenant-b"
2. SaveEvent to "tenant-a" stream
3. GetEvents from "tenant-a" returns the event
4. GetEvents from "tenant-b" returns empty
5. Stream names are prefixed: "tenant-a_events", "tenant-b_events"

---

#### Capability 5: Coordinate Cluster Topology

**Bounded Context:** Cluster Coordination

**Description:** The cluster automatically discovers nodes, elects a leader, and maintains a consistent view of which nodes are alive. Failed nodes are detected and marked unavailable.

**Domain Support:**
- **Context:** Cluster Coordination
- **Aggregates:** Cluster (group of nodes), LeadershipLease (time-bound authority)
- **Commands:** `JoinCluster()`, `ElectLeader()`, `MarkNodeFailed(nodeID)`, `PublishHeartbeat()`
- **Events:** `NodeJoined`, `NodeLeft`, `LeaderElected`, `LeadershipExpired`, `NodeFailed`
- **Invariants:**
  - At most one leader at any time
  - Leader lease expires and triggers re-election if holder dies
  - All nodes converge on same view of alive nodes

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/manager.go` - ClusterManager.Start() begins discovery
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/leader.go` - LeaderElection with heartbeats and re-election
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/discovery.go` - NodeDiscovery watches NATS for node announcements
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/types.go` - NodeInfo, LeadershipLease, NodeStatus

**Business Value:**
- **Pain eliminated:** No need to manually manage cluster topology or leader election
- **Job enabled:** Cluster heals itself on node failures; new nodes join automatically
- **Outcome:** Single source of truth for cluster state; leader can coordinate rebalancing
- **Beneficiary:** Infrastructure teams (Flowmade), anyone deploying Aether on multiple nodes

**Success Conditions:**
1. Three nodes start; after heartbeats, one is elected leader
2. Leader's lease renews regularly; all nodes see consistent LeaderID
3. Leader stops sending heartbeats; lease expires
4. Remaining nodes elect new leader within 2x HeartbeatInterval
5. Rejoining node detects it's behind and syncs cluster state

---

#### Capability 6: Distribute Actors Across Cluster Nodes

**Bounded Context:** Cluster Coordination

**Description:** Actors hash to shards; shards map to nodes using consistent hashing. Actor requests are routed to the shard owner. Topology changes minimize reshuffling.

**Domain Support:**
- **Context:** Cluster Coordination
- **Aggregates:** ShardMap (authoritative mapping of shards to nodes), ShardAssignment
- **Commands:** `AssignShards(nodeID, shardIDs)`, `RebalanceShards(fromNode, toNode, shardIDs)`
- **Events:** `ShardAssigned`, `ShardMigrated`, `RebalanceStarted`
- **Invariants:**
  - Each shard is owned by exactly one node
  - ActorID hashes consistently to same shard
  - Consistent hashing minimizes reshuffling on node add/remove

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/hashring.go` - ConsistentHashRing implements consistent hashing
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/shard.go` - ShardManager tracks shard ownership
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/types.go` - ShardMap, NodeInfo.ShardIDs

**Business Value:**
- **Pain eliminated:** No need for external shard registry or manual shard assignment
- **Job enabled:** Transparent actor distribution; requests route to correct node automatically
- **Outcome:** Load spreads evenly; adding nodes doesn't require full reshuffling
- **Beneficiary:** Distributed system builders

**Success Conditions:**
1. 3 nodes; 100 shards distributed evenly (33-33-34)
2. Actor "order-123" hashes to shard 42 consistently
3. Shard 42 is owned by node-b; request routed to node-b
4. Add node-d: ~25 shards rebalance, others stay put (minimal reshuffling)
5. Remove node-a: shards redistribute among remaining nodes

---

#### Capability 7: Recover from Node Failures

**Bounded Context:** Cluster Coordination

**Description:** When a node fails, its shards are automatically reassigned to healthy nodes. Actors replay from JetStream on the new node. Cluster remains available.

**Domain Support:**
- **Context:** Cluster Coordination
- **Aggregates:** Cluster, ShardAssignment
- **Commands:** `MarkNodeFailed(nodeID)`, `RebalanceShards(failedNode)`
- **Events:** `NodeFailed`, `ShardMigrated`, `ActorReplayed`
- **Invariants:**
  - Failed node's shards are claimed by healthy nodes within FailureDetectionTimeout
  - No actor leaves the cluster permanently

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/manager.go` - monitorNodes() detects heartbeat failures
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/leader.go` - Leader initiates rebalancing
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - Actors replay state from JetStream on new node

**Business Value:**
- **Pain eliminated:** No manual intervention to recover from node failure
- **Job enabled:** Cluster stays online despite single-node failures
- **Outcome:** RTO (recovery time) is bounded by rebalancing + replay time
- **Beneficiary:** Production systems requiring high availability

**Success Conditions:**
1. Node-a holds shards [1,2,3]; dies
2. Leader detects failure (heartbeat timeout)
3. Shards [1,2,3] reassigned to healthy nodes within 5 seconds
4. Actors in shards [1,2,3] replay from JetStream on new homes
5. New requests reach actors on new nodes; no data loss

---

#### Capability 8: Route and Filter Domain Events

**Bounded Context:** Event Bus

**Description:** Events published to a namespace are delivered to all subscribers of that namespace (or matching patterns). Subscribers can filter by event type or actor pattern.

**Domain Support:**
- **Context:** Event Bus
- **Aggregate:** EventBus (local pub/sub coordination)
- **Commands:** `Publish(namespace, event)`, `Subscribe(namespacePattern)`, `SubscribeWithFilter(namespacePattern, filter)`
- **Events:** `EventPublished`, `SubscriptionCreated`
- **Invariants:**
  - All subscribers of a namespace receive all events (before filters)
  - Filters are applied client-side; subscribers get only matching events
  - Exact subscriptions are isolated from wildcard subscriptions

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/eventbus.go` - EventBus.Publish, Subscribe, SubscribeWithFilter
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/pattern.go` - SubscriptionFilter, namespace pattern matching

**Business Value:**
- **Pain eliminated:** No need to manually route events; pub/sub happens automatically
- **Job enabled:** Loose coupling; event producers don't know about consumers
- **Outcome:** New subscribers can join/leave without touching publishers
- **Beneficiary:** Domain-driven architects building loosely coupled systems

**Success Conditions:**
1. Publish event to "orders" namespace
2. Exact subscriber of "orders" receives event
3. Wildcard subscriber of "order*" receives event
4. Subscriber with filter `{EventTypes: ["OrderPlaced"]}` receives event only if EventType="OrderPlaced"
5. Subscriber with actor pattern "order-customer-123" receives event only if ActorID matches

---

#### Capability 9: Deliver Events Across Cluster Nodes

**Bounded Context:** Event Bus (with NATS)

**Description:** Events published on one node reach subscribers on other nodes. NATS provides durability; namespace isolation is maintained across cluster.

**Domain Support:**
- **Context:** Event Bus extended via NATSEventBus
- **Aggregate:** EventBus (extended with NATS transport)
- **Commands:** `Publish(namespace, event)` [same interface, distributed transport]
- **Events:** `EventPublished` (locally), `EventDelivered` (via NATS)
- **Invariants:**
  - Events cross cluster boundaries; subscribers on any node receive them
  - Namespace isolation is enforced even across NATS
  - Self-sourced events (from publishing node) are not re-delivered

**Artifacts:**
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/nats_eventbus.go` - NATSEventBus wraps EventBus with NATS transport
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/nats_eventbus.go` - NATS subject format: "aether.events.{namespace}"

**Business Value:**
- **Pain eliminated:** Events automatically flow across cluster; no need to build message brokers
- **Job enabled:** Cross-node aggregations, sagas, and reactive workflows
- **Outcome:** NATS JetStream provides durability; events survive broker restarts
- **Beneficiary:** Distributed teams building event-driven architectures

**Success Conditions:**
1. Node-a publishes event to "orders" namespace
2. Subscriber on node-b (subscribed to "orders") receives event
3. Event crosses NATS; no local delivery interference
4. If node-b is offline, NATS JetStream buffers event
5. Node-b reconnects and receives buffered events in order

---

## Capability Groups

These capabilities work together in natural workflows:

### Event Sourcing Group

**Capabilities:**
1. Store Events Durably with Conflict Detection
2. Rebuild State from Event History
3. Enable Safe Concurrent Writes

**Workflow:** Concurrent writers each get latest version → attempt to write → detect conflicts → retry if needed → all writes land in immutable history → replay state deterministically

**Value:** Complete event history with safe concurrency enables auditable, reproducible state.

---

### Cluster Coordination Group

**Capabilities:**
5. Coordinate Cluster Topology
6. Distribute Actors Across Cluster Nodes
7. Recover from Node Failures

**Workflow:** Nodes join → topology stabilizes → leader elected → shards assigned to nodes → failure detected → shards reassigned → actors replay on new nodes

**Value:** Cluster self-heals and maintains availability despite node failures.

---

### Event Distribution Group

**Capabilities:**
4. Isolate Logical Domains Using Namespaces
8. Route and Filter Domain Events
9. Deliver Events Across Cluster Nodes

**Workflow:** Event published to namespace → local subscribers receive → if NATS enabled, remote subscribers receive → namespace isolation prevents cross-contamination → filters narrow delivery

**Value:** Loose coupling across cluster; namespace isolation ensures multi-tenant safety.

---

## Capability Classification

### Core Capabilities

**Why these matter:** Unique to Aether; hard to build; competitive differentiators.

- **Store Events Durably with Conflict Detection** - Core to event sourcing; requires version semantics that most systems lack
- **Rebuild State from Event History** - Enables replay and audit; not common in CRUD systems
- **Enable Safe Concurrent Writes** - Optimistic locking at domain level; avoids lock/deadlock issues
- **Coordinate Cluster Topology** - Automated leader election and failure detection; not a commodity feature
- **Distribute Actors Across Cluster Nodes** - Consistent hashing + shard mapping; built-in, not bolted-on

### Supporting Capabilities

**Why these matter:** Necessary; not unique; often bolted-on elsewhere.

- **Isolate Logical Domains Using Namespaces** - Enables multi-tenancy patterns; important but implementable in application code
- **Route and Filter Domain Events** - Standard pub/sub; Aether provides it bundled
- **Recover from Node Failures** - Expected of any distributed system; Aether automates it
- **Deliver Events Across Cluster Nodes** - Standard NATS feature; Aether integrates seamlessly

### Generic Capabilities

**Why these matter:** Commodity; consider off-the-shelf; not differentiating.

None explicitly, but potential future work:
- **Metrics and Monitoring** - Could use Prometheus exporter
- **Distributed Tracing** - Could integrate OpenTelemetry
- **Access Control** - Could add RBAC for namespace subscriptions

---

## Value Map

### Capability: Store Events Durably with Conflict Detection

- **Pain eliminated:** Race conditions corrupting state; lost writes; need for pessimistic locking
- **Job enabled:** Safe concurrent updates without locks
- **Outcome:** Immutable event history with version conflict detection
- **Beneficiary:** Go developers, platform teams
- **Priority:** Core

### Capability: Rebuild State from Event History

- **Pain eliminated:** Need for separate read models; inability to query historical state
- **Job enabled:** Temporal queries; debugging; rebuilding after corruption
- **Outcome:** State reproducible from immutable history
- **Beneficiary:** Platform operators, auditors
- **Priority:** Core

### Capability: Enable Safe Concurrent Writes

- **Pain eliminated:** Deadlocks, lock contention, pessimistic locking overhead
- **Job enabled:** High-concurrency updates (collaborative editing, distributed aggregates)
- **Outcome:** Conflicts detected immediately; application controls retry
- **Beneficiary:** Multi-user systems, distributed systems
- **Priority:** Core

### Capability: Isolate Logical Domains Using Namespaces

- **Pain eliminated:** Event leakage between tenants/contexts; complex application isolation logic
- **Job enabled:** Multi-tenant deployments without architectural complexity
- **Outcome:** Storage-level isolation enforced automatically
- **Beneficiary:** SaaS platforms, multi-bounded-context systems
- **Priority:** Supporting

### Capability: Coordinate Cluster Topology

- **Pain eliminated:** Manual cluster management; single points of failure in leader election
- **Job enabled:** Automated discovery, leader election, failure detection
- **Outcome:** Self-healing cluster with single authoritative view of state
- **Beneficiary:** Infrastructure teams, production deployments
- **Priority:** Core

### Capability: Distribute Actors Across Cluster Nodes

- **Pain eliminated:** Manual shard assignment; external shard registry
- **Job enabled:** Transparent actor routing; load balancing
- **Outcome:** Consistent hashing minimizes reshuffling on topology changes
- **Beneficiary:** Distributed system architects
- **Priority:** Core

### Capability: Recover from Node Failures

- **Pain eliminated:** Manual failover; data loss; downtime
- **Job enabled:** Cluster stays online despite node failures
- **Outcome:** Shards reassigned and actors replayed automatically
- **Beneficiary:** Production systems requiring HA
- **Priority:** Core

### Capability: Route and Filter Domain Events

- **Pain eliminated:** Tight coupling between event sources and consumers
- **Job enabled:** Loose coupling; async workflows; event-driven architecture
- **Outcome:** Events routed automatically; consumers filter independently
- **Beneficiary:** Domain-driven architects
- **Priority:** Supporting

### Capability: Deliver Events Across Cluster Nodes

- **Pain eliminated:** Building custom message brokers; NATS integration boilerplate
- **Job enabled:** Cross-node aggregations, sagas, reactive workflows
- **Outcome:** Events travel cluster; NATS JetStream provides durability
- **Beneficiary:** Distributed teams
- **Priority:** Supporting

---

## Success Conditions

### Capability: Store Events Durably with Conflict Detection

- **Condition:** Concurrent SaveEvent calls with same version both fail on second attempt
- **Metric:** VersionConflictError returned within <1ms
- **Target:** 100% of conflicts detected; 0% silent failures

### Capability: Rebuild State from Event History

- **Condition:** GetEvents + replay produces identical state every time
- **Metric:** Replay time O(1) with snapshot, O(n) without
- **Target:** Snapshots reduce replay by >90%; no data loss during replay

### Capability: Enable Safe Concurrent Writes

- **Condition:** Two writers race; one wins, other sees conflict
- **Metric:** Conflict detection <1ms; application can retry
- **Target:** No deadlocks; no pessimistic locks held

### Capability: Isolate Logical Domains Using Namespaces

- **Condition:** Events in namespace A are invisible to namespace B
- **Metric:** Storage-level isolation (separate stream names)
- **Target:** 100% isolation; no cross-namespace leakage

### Capability: Coordinate Cluster Topology

- **Condition:** Three nodes start; one elected leader within 5 seconds
- **Metric:** Leader election time; all nodes converge on same leader
- **Target:** Election completes within HeartbeatInterval * 2

### Capability: Distribute Actors Across Cluster Nodes

- **Condition:** ActorID hashes to same shard consistently; shard maps to same node
- **Metric:** Hash consistency; reshuffling on add/remove
- **Target:** Consistent hashing; <25% reshuffling on node change

### Capability: Recover from Node Failures

- **Condition:** Node failure detected; shards reassigned within timeout
- **Metric:** Failure detection time; rebalancing time
- **Target:** <10 seconds to detect; shards reassigned; actors online

### Capability: Route and Filter Domain Events

- **Condition:** Event published to namespace; exact subscriber receives; wildcard subscriber receives; filtered subscriber receives iff match
- **Metric:** Delivery latency; filter accuracy
- **Target:** <10ms delivery; 100% filter accuracy

### Capability: Deliver Events Across Cluster Nodes

- **Condition:** Event published on node-a; subscriber on node-b receives
- **Metric:** Cross-node delivery latency; durability
- **Target:** <50ms cross-node delivery; NATS JetStream preserves events

---

## Dependencies Between Capabilities

```
Store Events Durably with Conflict Detection
    ↓ (enables)
Enable Safe Concurrent Writes
    ↓
Rebuild State from Event History

Coordinate Cluster Topology
    ↓ (enables)
Distribute Actors Across Cluster Nodes
    ↓ (enables)
Recover from Node Failures

Isolate Logical Domains Using Namespaces
    ↓ (enables)
Route and Filter Domain Events
    ↓ (enables)
Deliver Events Across Cluster Nodes
```

**Implementation Order:**

1. **Event Sourcing block** (capabilities 1-3): Core; enables all domain models
2. **Local Event Bus** (capabilities 8): Use before clustering
3. **Cluster Coordination** (capabilities 5-7): Add once Event Sourcing is solid
4. **Namespace Isolation** (capability 4): Orthogonal; add when multi-tenancy needed
5. **NATS Event Delivery** (capability 9): Final piece; integrates all above

---

## Recommendations

### Build First (Value/Effort Ratio)

1. **Store Events Durably with Conflict Detection** - Foundation; everything depends on it
2. **Coordinate Cluster Topology** - Self-healing clusters are table-stakes for distributed systems
3. **Distribute Actors Across Cluster Nodes** - Completes the clustering story
4. **Enable Safe Concurrent Writes** - Unlocks multi-writer use cases
5. **Route and Filter Domain Events** - Enables loose coupling

### Build Next (Expanding Use Cases)

6. **Rebuild State from Event History** - Audit and debugging; often implemented after core
7. **Recover from Node Failures** - Completes HA story
8. **Deliver Events Across Cluster Nodes** - NATS integration; final scale piece
9. **Isolate Logical Domains Using Namespaces** - Multi-tenancy; add when needed

### Consider Off-the-Shelf or Later

- **Metrics and Monitoring** - Use Prometheus exporter (vendor standard)
- **Distributed Tracing** - Integrate OpenTelemetry when debugging distributed flows
- **Access Control** - Add RBAC if multi-tenancy requires fine-grained permission control

### Architecture Insights

**NATS-Native Design:**
Aether is built for JetStream from the start, not bolted on. This means:
- Event storage directly uses JetStream (not a wrapper around Postgres)
- Pub/sub directly uses NATS subjects (not a custom message queue)
- Cluster coordination uses NATS for discovery and messaging (not gossip or Raft)

**Implication:** If you're already using NATS, Aether requires no additional infrastructure.

**Primitives Over Frameworks:**
Aether provides:
- EventStore interface (you choose implementation)
- EventBus interface (you choose local vs NATSEventBus)
- Consistent hashing (you compose it)
- Leader election (you decide what to do with leadership)

Implication: You own the domain model; Aether doesn't impose it.

**Capability-First Decomposition:**
Rather than thinking "features," think "what can the system do":
- "Store events durably" (capability) enables "event sourcing" (architectural pattern) enables "event-driven architecture" (design pattern)

This prevents feature churn and focuses implementation on value.

---

## Related Documents

- **Vision** ([/Users/hugo.nijhuis/src/github/flowmade-one/aether/vision.md](./vision.md)) - Product positioning and constraints
- **CLAUDE.md** (in this repo) - Architecture patterns and version semantics
- **Organization Manifesto** - [https://git.flowmade.one/flowmade-one/architecture/src/branch/main/manifesto.md](https://git.flowmade.one/flowmade-one/architecture/src/branch/main/manifesto.md)