Organize all product strategy and domain modeling documentation into a dedicated .product-strategy directory for better separation from code. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
623 lines
28 KiB
Markdown
623 lines
28 KiB
Markdown
# Aether Product Capabilities
|
|
|
|
This document maps Aether's domain models to product capabilities: "the system's ability to cause meaningful domain changes." These capabilities bridge domain models to product value.
|
|
|
|
## Summary
|
|
|
|
Aether provides 9 core capabilities across 5 bounded contexts. These capabilities enable teams building distributed, event-sourced systems in Go to:
|
|
|
|
- Store events durably with automatic conflict detection for safe concurrent writes
|
|
- Rebuild application state from immutable event history
|
|
- Isolate logical domains using namespace boundaries without architectural complexity
|
|
- Coordinate distributed clusters with automatic leader election and shard rebalancing
|
|
- Route domain events across nodes with flexible filtering and NATS-native delivery
|
|
|
|
## Capabilities
|
|
|
|
### Core Capabilities
|
|
|
|
#### Capability 1: Store Events Durably with Conflict Detection
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**Description:** The system can persist domain events as the source of truth while preventing lost writes through monotonic version enforcement and detecting concurrent modifications before data corruption occurs.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Event Sourcing
|
|
- **Aggregate:** ActorEventStream (implicit - each actor has an event stream)
|
|
- **Commands:** `SaveEvent(event)`, `GetLatestVersion(actorID)`
|
|
- **Events:** `EventStored`, `VersionConflictDetected`
|
|
- **Invariant:** Monotonically increasing versions per actor; no version <= current version can be accepted
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/event.go` - Event type, VersionConflictError, EventStore interface
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - JetStreamEventStore implements SaveEvent with version validation
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/memory.go` - InMemoryEventStore for testing
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** Developers no longer fear concurrent writes corrupting state
|
|
- **Job enabled:** Applications can safely update actors from multiple writers (no locks, no deadlocks)
|
|
- **Outcome:** Events form an immutable, append-only history; conflicts fail fast
|
|
- **Beneficiary:** Go teams building distributed systems
|
|
|
|
**Success Conditions:**
|
|
1. Multiple writers can attempt to update the same actor simultaneously
|
|
2. At most one writer succeeds; others receive ErrVersionConflict
|
|
3. Failed writers can inspect CurrentVersion and retry with next version
|
|
4. No events are lost or overwritten
|
|
5. Version conflicts are detected in <1ms (optimistic locking, not pessimistic)
|
|
|
|
---
|
|
|
|
#### Capability 2: Rebuild State from Event History
|
|
|
|
**Bounded Context:** Event Sourcing
|
|
|
|
**Description:** The system can derive any past or present application state by replaying events from a starting version forward. Snapshots optimize replay for long-lived actors.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Event Sourcing
|
|
- **Aggregate:** ActorEventStream
|
|
- **Commands:** `GetEvents(actorID, fromVersion)`, `GetLatestSnapshot(actorID)`, `SaveSnapshot(snapshot)`
|
|
- **Events:** `ReplayStarted`, `ReplayCompleted`, `SnapshotCreated`
|
|
- **Invariant:** Event history is immutable; replay of same events always produces same state
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/event.go` - EventStore.GetEvents, SnapshotStore interface, ActorSnapshot type
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - Implements GetEvents with optional snapshots
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** No need for separate read models; state can be reconstructed on demand
|
|
- **Job enabled:** Debugging "how did we get here?", rebuilding state after corruption, temporal queries
|
|
- **Outcome:** Complete audit trail; state at any point in time is reproducible
|
|
- **Beneficiary:** Platform builders (Flowmade teams), consultancies auditing systems
|
|
|
|
**Success Conditions:**
|
|
1. `GetEvents(actorID, 0)` returns all events in order
|
|
2. Replaying all events to state produces identical result every time
|
|
3. Snapshot reduces replay time from O(n) to O(1) after snapshot
|
|
4. Snapshots are optional; system works without them
|
|
5. Corrupted events are reported (ReplayError) without losing clean data
|
|
|
|
---
|
|
|
|
#### Capability 3: Enable Safe Concurrent Writes
|
|
|
|
**Bounded Context:** Optimistic Concurrency Control
|
|
|
|
**Description:** Multiple concurrent writers can update the same actor without locks. Conflicts are detected immediately; application controls retry strategy.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Optimistic Concurrency Control (enabled by Event Sourcing capability)
|
|
- **Aggregate:** ActorEventStream
|
|
- **Commands:** `ReadVersion(actorID)`, `AttemptWrite(event)` [implicit in SaveEvent]
|
|
- **Events:** `WriteSucceeded`, `WriteFailed` (as VersionConflictError)
|
|
- **Invariant:** If two writes race, exactly one wins; the other sees the conflict
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/event.go` - VersionConflictError type with CurrentVersion details
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - Version validation in SaveEvent
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** No need for pessimistic locking (locks, deadlocks, performance cliffs)
|
|
- **Job enabled:** High-concurrency writes (e.g., multi-user edits, distributed aggregates)
|
|
- **Outcome:** Application has visibility into conflicts; can implement backoff, circuit-break, or merge strategies
|
|
- **Beneficiary:** Go teams building collaborative or distributed systems
|
|
|
|
**Success Conditions:**
|
|
1. Two concurrent SaveEvent calls with same (actorID, currentVersion) both read version 1
|
|
2. First SaveEvent(version: 2) succeeds
|
|
3. Second SaveEvent(version: 2) receives VersionConflictError with CurrentVersion=2
|
|
4. Application can call GetLatestVersion again and retry with version 3
|
|
5. No database-level locks held during any of this (optimistic, not pessimistic)
|
|
|
|
---
|
|
|
|
#### Capability 4: Isolate Logical Domains Using Namespaces
|
|
|
|
**Bounded Context:** Namespace Isolation
|
|
|
|
**Description:** Events in one namespace are completely invisible to queries, subscriptions, and storage of another namespace. Namespaces enable logical boundaries without architectural complexity.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Namespace Isolation
|
|
- **Concepts:** Namespace (value object, not aggregate)
|
|
- **Commands:** `PublishToNamespace(namespace, event)`, `SubscribeToNamespace(namespace)`, `GetEventsInNamespace(namespace, actorID)`
|
|
- **Events:** Events carry namespace context
|
|
- **Invariant:** Events stored with namespace X cannot be retrieved from namespace Y
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - JetStreamConfig.Namespace, stream name becomes "{namespace}_{streamName}"
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/eventbus.go` - Subscribe(namespacePattern), Publish(namespaceID)
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/nats_eventbus.go` - NATS subject routing with namespace isolation
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** Multi-tenant or multi-domain systems don't need complex isolation logic in application code
|
|
- **Job enabled:** Separate bounded contexts can coexist on same cluster without leaking events
|
|
- **Outcome:** Storage-level isolation ensures data cannot leak between namespaces
|
|
- **Beneficiary:** Platform builders (Flowmade), SaaS products using Aether
|
|
|
|
**Success Conditions:**
|
|
1. Create two stores with different namespaces: "tenant-a", "tenant-b"
|
|
2. SaveEvent to "tenant-a" stream
|
|
3. GetEvents from "tenant-a" returns the event
|
|
4. GetEvents from "tenant-b" returns empty
|
|
5. Stream names are prefixed: "tenant-a_events", "tenant-b_events"
|
|
|
|
---
|
|
|
|
#### Capability 5: Coordinate Cluster Topology
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**Description:** The cluster automatically discovers nodes, elects a leader, and maintains a consistent view of which nodes are alive. Failed nodes are detected and marked unavailable.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Cluster Coordination
|
|
- **Aggregates:** Cluster (group of nodes), LeadershipLease (time-bound authority)
|
|
- **Commands:** `JoinCluster()`, `ElectLeader()`, `MarkNodeFailed(nodeID)`, `PublishHeartbeat()`
|
|
- **Events:** `NodeJoined`, `NodeLeft`, `LeaderElected`, `LeadershipExpired`, `NodeFailed`
|
|
- **Invariants:**
|
|
- At most one leader at any time
|
|
- Leader lease expires and triggers re-election if holder dies
|
|
- All nodes converge on same view of alive nodes
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/manager.go` - ClusterManager.Start() begins discovery
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/leader.go` - LeaderElection with heartbeats and re-election
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/discovery.go` - NodeDiscovery watches NATS for node announcements
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/types.go` - NodeInfo, LeadershipLease, NodeStatus
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** No need to manually manage cluster topology or leader election
|
|
- **Job enabled:** Cluster heals itself on node failures; new nodes join automatically
|
|
- **Outcome:** Single source of truth for cluster state; leader can coordinate rebalancing
|
|
- **Beneficiary:** Infrastructure teams (Flowmade), anyone deploying Aether on multiple nodes
|
|
|
|
**Success Conditions:**
|
|
1. Three nodes start; after heartbeats, one is elected leader
|
|
2. Leader's lease renews regularly; all nodes see consistent LeaderID
|
|
3. Leader stops sending heartbeats; lease expires
|
|
4. Remaining nodes elect new leader within 2x HeartbeatInterval
|
|
5. Rejoining node detects it's behind and syncs cluster state
|
|
|
|
---
|
|
|
|
#### Capability 6: Distribute Actors Across Cluster Nodes
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**Description:** Actors hash to shards; shards map to nodes using consistent hashing. Actor requests are routed to the shard owner. Topology changes minimize reshuffling.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Cluster Coordination
|
|
- **Aggregates:** ShardMap (authoritative mapping of shards to nodes), ShardAssignment
|
|
- **Commands:** `AssignShards(nodeID, shardIDs)`, `RebalanceShards(fromNode, toNode, shardIDs)`
|
|
- **Events:** `ShardAssigned`, `ShardMigrated`, `RebalanceStarted`
|
|
- **Invariants:**
|
|
- Each shard is owned by exactly one node
|
|
- ActorID hashes consistently to same shard
|
|
- Consistent hashing minimizes reshuffling on node add/remove
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/hashring.go` - ConsistentHashRing implements consistent hashing
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/shard.go` - ShardManager tracks shard ownership
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/types.go` - ShardMap, NodeInfo.ShardIDs
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** No need for external shard registry or manual shard assignment
|
|
- **Job enabled:** Transparent actor distribution; requests route to correct node automatically
|
|
- **Outcome:** Load spreads evenly; adding nodes doesn't require full reshuffling
|
|
- **Beneficiary:** Distributed system builders
|
|
|
|
**Success Conditions:**
|
|
1. 3 nodes; 100 shards distributed evenly (33-33-34)
|
|
2. Actor "order-123" hashes to shard 42 consistently
|
|
3. Shard 42 is owned by node-b; request routed to node-b
|
|
4. Add node-d: ~25 shards rebalance, others stay put (minimal reshuffling)
|
|
5. Remove node-a: shards redistribute among remaining nodes
|
|
|
|
---
|
|
|
|
#### Capability 7: Recover from Node Failures
|
|
|
|
**Bounded Context:** Cluster Coordination
|
|
|
|
**Description:** When a node fails, its shards are automatically reassigned to healthy nodes. Actors replay from JetStream on the new node. Cluster remains available.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Cluster Coordination
|
|
- **Aggregates:** Cluster, ShardAssignment
|
|
- **Commands:** `MarkNodeFailed(nodeID)`, `RebalanceShards(failedNode)`
|
|
- **Events:** `NodeFailed`, `ShardMigrated`, `ActorReplayed`
|
|
- **Invariants:**
|
|
- Failed node's shards are claimed by healthy nodes within FailureDetectionTimeout
|
|
- No actor leaves the cluster permanently
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/manager.go` - monitorNodes() detects heartbeat failures
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/cluster/leader.go` - Leader initiates rebalancing
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/store/jetstream.go` - Actors replay state from JetStream on new node
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** No manual intervention to recover from node failure
|
|
- **Job enabled:** Cluster stays online despite single-node failures
|
|
- **Outcome:** RTO (recovery time) is bounded by rebalancing + replay time
|
|
- **Beneficiary:** Production systems requiring high availability
|
|
|
|
**Success Conditions:**
|
|
1. Node-a holds shards [1,2,3]; dies
|
|
2. Leader detects failure (heartbeat timeout)
|
|
3. Shards [1,2,3] reassigned to healthy nodes within 5 seconds
|
|
4. Actors in shards [1,2,3] replay from JetStream on new homes
|
|
5. New requests reach actors on new nodes; no data loss
|
|
|
|
---
|
|
|
|
#### Capability 8: Route and Filter Domain Events
|
|
|
|
**Bounded Context:** Event Bus
|
|
|
|
**Description:** Events published to a namespace are delivered to all subscribers of that namespace (or matching patterns). Subscribers can filter by event type or actor pattern.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Event Bus
|
|
- **Aggregate:** EventBus (local pub/sub coordination)
|
|
- **Commands:** `Publish(namespace, event)`, `Subscribe(namespacePattern)`, `SubscribeWithFilter(namespacePattern, filter)`
|
|
- **Events:** `EventPublished`, `SubscriptionCreated`
|
|
- **Invariants:**
|
|
- All subscribers of a namespace receive all events (before filters)
|
|
- Filters are applied client-side; subscribers get only matching events
|
|
- Exact subscriptions are isolated from wildcard subscriptions
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/eventbus.go` - EventBus.Publish, Subscribe, SubscribeWithFilter
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/pattern.go` - SubscriptionFilter, namespace pattern matching
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** No need to manually route events; pub/sub happens automatically
|
|
- **Job enabled:** Loose coupling; event producers don't know about consumers
|
|
- **Outcome:** New subscribers can join/leave without touching publishers
|
|
- **Beneficiary:** Domain-driven architects building loosely coupled systems
|
|
|
|
**Success Conditions:**
|
|
1. Publish event to "orders" namespace
|
|
2. Exact subscriber of "orders" receives event
|
|
3. Wildcard subscriber of "order*" receives event
|
|
4. Subscriber with filter `{EventTypes: ["OrderPlaced"]}` receives event only if EventType="OrderPlaced"
|
|
5. Subscriber with actor pattern "order-customer-123" receives event only if ActorID matches
|
|
|
|
---
|
|
|
|
#### Capability 9: Deliver Events Across Cluster Nodes
|
|
|
|
**Bounded Context:** Event Bus (with NATS)
|
|
|
|
**Description:** Events published on one node reach subscribers on other nodes. NATS provides durability; namespace isolation is maintained across cluster.
|
|
|
|
**Domain Support:**
|
|
- **Context:** Event Bus extended via NATSEventBus
|
|
- **Aggregate:** EventBus (extended with NATS transport)
|
|
- **Commands:** `Publish(namespace, event)` [same interface, distributed transport]
|
|
- **Events:** `EventPublished` (locally), `EventDelivered` (via NATS)
|
|
- **Invariants:**
|
|
- Events cross cluster boundaries; subscribers on any node receive them
|
|
- Namespace isolation is enforced even across NATS
|
|
- Self-sourced events (from publishing node) are not re-delivered
|
|
|
|
**Artifacts:**
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/nats_eventbus.go` - NATSEventBus wraps EventBus with NATS transport
|
|
- `/Users/hugo.nijhuis/src/github/flowmade-one/aether/nats_eventbus.go` - NATS subject format: "aether.events.{namespace}"
|
|
|
|
**Business Value:**
|
|
- **Pain eliminated:** Events automatically flow across cluster; no need to build message brokers
|
|
- **Job enabled:** Cross-node aggregations, sagas, and reactive workflows
|
|
- **Outcome:** NATS JetStream provides durability; events survive broker restarts
|
|
- **Beneficiary:** Distributed teams building event-driven architectures
|
|
|
|
**Success Conditions:**
|
|
1. Node-a publishes event to "orders" namespace
|
|
2. Subscriber on node-b (subscribed to "orders") receives event
|
|
3. Event crosses NATS; no local delivery interference
|
|
4. If node-b is offline, NATS JetStream buffers event
|
|
5. Node-b reconnects and receives buffered events in order
|
|
|
|
---
|
|
|
|
## Capability Groups
|
|
|
|
These capabilities work together in natural workflows:
|
|
|
|
### Event Sourcing Group
|
|
|
|
**Capabilities:**
|
|
1. Store Events Durably with Conflict Detection
|
|
2. Rebuild State from Event History
|
|
3. Enable Safe Concurrent Writes
|
|
|
|
**Workflow:** Concurrent writers each get latest version → attempt to write → detect conflicts → retry if needed → all writes land in immutable history → replay state deterministically
|
|
|
|
**Value:** Complete event history with safe concurrency enables auditable, reproducible state.
|
|
|
|
---
|
|
|
|
### Cluster Coordination Group
|
|
|
|
**Capabilities:**
|
|
5. Coordinate Cluster Topology
|
|
6. Distribute Actors Across Cluster Nodes
|
|
7. Recover from Node Failures
|
|
|
|
**Workflow:** Nodes join → topology stabilizes → leader elected → shards assigned to nodes → failure detected → shards reassigned → actors replay on new nodes
|
|
|
|
**Value:** Cluster self-heals and maintains availability despite node failures.
|
|
|
|
---
|
|
|
|
### Event Distribution Group
|
|
|
|
**Capabilities:**
|
|
4. Isolate Logical Domains Using Namespaces
|
|
8. Route and Filter Domain Events
|
|
9. Deliver Events Across Cluster Nodes
|
|
|
|
**Workflow:** Event published to namespace → local subscribers receive → if NATS enabled, remote subscribers receive → namespace isolation prevents cross-contamination → filters narrow delivery
|
|
|
|
**Value:** Loose coupling across cluster; namespace isolation ensures multi-tenant safety.
|
|
|
|
---
|
|
|
|
## Capability Classification
|
|
|
|
### Core Capabilities
|
|
|
|
**Why these matter:** Unique to Aether; hard to build; competitive differentiators.
|
|
|
|
- **Store Events Durably with Conflict Detection** - Core to event sourcing; requires version semantics that most systems lack
|
|
- **Rebuild State from Event History** - Enables replay and audit; not common in CRUD systems
|
|
- **Enable Safe Concurrent Writes** - Optimistic locking at domain level; avoids lock/deadlock issues
|
|
- **Coordinate Cluster Topology** - Automated leader election and failure detection; not a commodity feature
|
|
- **Distribute Actors Across Cluster Nodes** - Consistent hashing + shard mapping; built-in, not bolted-on
|
|
|
|
### Supporting Capabilities
|
|
|
|
**Why these matter:** Necessary; not unique; often bolted-on elsewhere.
|
|
|
|
- **Isolate Logical Domains Using Namespaces** - Enables multi-tenancy patterns; important but implementable in application code
|
|
- **Route and Filter Domain Events** - Standard pub/sub; Aether provides it bundled
|
|
- **Recover from Node Failures** - Expected of any distributed system; Aether automates it
|
|
- **Deliver Events Across Cluster Nodes** - Standard NATS feature; Aether integrates seamlessly
|
|
|
|
### Generic Capabilities
|
|
|
|
**Why these matter:** Commodity; consider off-the-shelf; not differentiating.
|
|
|
|
None explicitly, but potential future work:
|
|
- **Metrics and Monitoring** - Could use Prometheus exporter
|
|
- **Distributed Tracing** - Could integrate OpenTelemetry
|
|
- **Access Control** - Could add RBAC for namespace subscriptions
|
|
|
|
---
|
|
|
|
## Value Map
|
|
|
|
### Capability: Store Events Durably with Conflict Detection
|
|
|
|
- **Pain eliminated:** Race conditions corrupting state; lost writes; need for pessimistic locking
|
|
- **Job enabled:** Safe concurrent updates without locks
|
|
- **Outcome:** Immutable event history with version conflict detection
|
|
- **Beneficiary:** Go developers, platform teams
|
|
- **Priority:** Core
|
|
|
|
### Capability: Rebuild State from Event History
|
|
|
|
- **Pain eliminated:** Need for separate read models; inability to query historical state
|
|
- **Job enabled:** Temporal queries; debugging; rebuilding after corruption
|
|
- **Outcome:** State reproducible from immutable history
|
|
- **Beneficiary:** Platform operators, auditors
|
|
- **Priority:** Core
|
|
|
|
### Capability: Enable Safe Concurrent Writes
|
|
|
|
- **Pain eliminated:** Deadlocks, lock contention, pessimistic locking overhead
|
|
- **Job enabled:** High-concurrency updates (collaborative editing, distributed aggregates)
|
|
- **Outcome:** Conflicts detected immediately; application controls retry
|
|
- **Beneficiary:** Multi-user systems, distributed systems
|
|
- **Priority:** Core
|
|
|
|
### Capability: Isolate Logical Domains Using Namespaces
|
|
|
|
- **Pain eliminated:** Event leakage between tenants/contexts; complex application isolation logic
|
|
- **Job enabled:** Multi-tenant deployments without architectural complexity
|
|
- **Outcome:** Storage-level isolation enforced automatically
|
|
- **Beneficiary:** SaaS platforms, multi-bounded-context systems
|
|
- **Priority:** Supporting
|
|
|
|
### Capability: Coordinate Cluster Topology
|
|
|
|
- **Pain eliminated:** Manual cluster management; single points of failure in leader election
|
|
- **Job enabled:** Automated discovery, leader election, failure detection
|
|
- **Outcome:** Self-healing cluster with single authoritative view of state
|
|
- **Beneficiary:** Infrastructure teams, production deployments
|
|
- **Priority:** Core
|
|
|
|
### Capability: Distribute Actors Across Cluster Nodes
|
|
|
|
- **Pain eliminated:** Manual shard assignment; external shard registry
|
|
- **Job enabled:** Transparent actor routing; load balancing
|
|
- **Outcome:** Consistent hashing minimizes reshuffling on topology changes
|
|
- **Beneficiary:** Distributed system architects
|
|
- **Priority:** Core
|
|
|
|
### Capability: Recover from Node Failures
|
|
|
|
- **Pain eliminated:** Manual failover; data loss; downtime
|
|
- **Job enabled:** Cluster stays online despite node failures
|
|
- **Outcome:** Shards reassigned and actors replayed automatically
|
|
- **Beneficiary:** Production systems requiring HA
|
|
- **Priority:** Core
|
|
|
|
### Capability: Route and Filter Domain Events
|
|
|
|
- **Pain eliminated:** Tight coupling between event sources and consumers
|
|
- **Job enabled:** Loose coupling; async workflows; event-driven architecture
|
|
- **Outcome:** Events routed automatically; consumers filter independently
|
|
- **Beneficiary:** Domain-driven architects
|
|
- **Priority:** Supporting
|
|
|
|
### Capability: Deliver Events Across Cluster Nodes
|
|
|
|
- **Pain eliminated:** Building custom message brokers; NATS integration boilerplate
|
|
- **Job enabled:** Cross-node aggregations, sagas, reactive workflows
|
|
- **Outcome:** Events travel cluster; NATS JetStream provides durability
|
|
- **Beneficiary:** Distributed teams
|
|
- **Priority:** Supporting
|
|
|
|
---
|
|
|
|
## Success Conditions
|
|
|
|
### Capability: Store Events Durably with Conflict Detection
|
|
|
|
- **Condition:** Concurrent SaveEvent calls with same version both fail on second attempt
|
|
- **Metric:** VersionConflictError returned within <1ms
|
|
- **Target:** 100% of conflicts detected; 0% silent failures
|
|
|
|
### Capability: Rebuild State from Event History
|
|
|
|
- **Condition:** GetEvents + replay produces identical state every time
|
|
- **Metric:** Replay time O(1) with snapshot, O(n) without
|
|
- **Target:** Snapshots reduce replay by >90%; no data loss during replay
|
|
|
|
### Capability: Enable Safe Concurrent Writes
|
|
|
|
- **Condition:** Two writers race; one wins, other sees conflict
|
|
- **Metric:** Conflict detection <1ms; application can retry
|
|
- **Target:** No deadlocks; no pessimistic locks held
|
|
|
|
### Capability: Isolate Logical Domains Using Namespaces
|
|
|
|
- **Condition:** Events in namespace A are invisible to namespace B
|
|
- **Metric:** Storage-level isolation (separate stream names)
|
|
- **Target:** 100% isolation; no cross-namespace leakage
|
|
|
|
### Capability: Coordinate Cluster Topology
|
|
|
|
- **Condition:** Three nodes start; one elected leader within 5 seconds
|
|
- **Metric:** Leader election time; all nodes converge on same leader
|
|
- **Target:** Election completes within HeartbeatInterval * 2
|
|
|
|
### Capability: Distribute Actors Across Cluster Nodes
|
|
|
|
- **Condition:** ActorID hashes to same shard consistently; shard maps to same node
|
|
- **Metric:** Hash consistency; reshuffling on add/remove
|
|
- **Target:** Consistent hashing; <25% reshuffling on node change
|
|
|
|
### Capability: Recover from Node Failures
|
|
|
|
- **Condition:** Node failure detected; shards reassigned within timeout
|
|
- **Metric:** Failure detection time; rebalancing time
|
|
- **Target:** <10 seconds to detect; shards reassigned; actors online
|
|
|
|
### Capability: Route and Filter Domain Events
|
|
|
|
- **Condition:** Event published to namespace; exact subscriber receives; wildcard subscriber receives; filtered subscriber receives iff match
|
|
- **Metric:** Delivery latency; filter accuracy
|
|
- **Target:** <10ms delivery; 100% filter accuracy
|
|
|
|
### Capability: Deliver Events Across Cluster Nodes
|
|
|
|
- **Condition:** Event published on node-a; subscriber on node-b receives
|
|
- **Metric:** Cross-node delivery latency; durability
|
|
- **Target:** <50ms cross-node delivery; NATS JetStream preserves events
|
|
|
|
---
|
|
|
|
## Dependencies Between Capabilities
|
|
|
|
```
|
|
Store Events Durably with Conflict Detection
|
|
↓ (enables)
|
|
Enable Safe Concurrent Writes
|
|
↓
|
|
Rebuild State from Event History
|
|
|
|
Coordinate Cluster Topology
|
|
↓ (enables)
|
|
Distribute Actors Across Cluster Nodes
|
|
↓ (enables)
|
|
Recover from Node Failures
|
|
|
|
Isolate Logical Domains Using Namespaces
|
|
↓ (enables)
|
|
Route and Filter Domain Events
|
|
↓ (enables)
|
|
Deliver Events Across Cluster Nodes
|
|
```
|
|
|
|
**Implementation Order:**
|
|
|
|
1. **Event Sourcing block** (capabilities 1-3): Core; enables all domain models
|
|
2. **Local Event Bus** (capabilities 8): Use before clustering
|
|
3. **Cluster Coordination** (capabilities 5-7): Add once Event Sourcing is solid
|
|
4. **Namespace Isolation** (capability 4): Orthogonal; add when multi-tenancy needed
|
|
5. **NATS Event Delivery** (capability 9): Final piece; integrates all above
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### Build First (Value/Effort Ratio)
|
|
|
|
1. **Store Events Durably with Conflict Detection** - Foundation; everything depends on it
|
|
2. **Coordinate Cluster Topology** - Self-healing clusters are table-stakes for distributed systems
|
|
3. **Distribute Actors Across Cluster Nodes** - Completes the clustering story
|
|
4. **Enable Safe Concurrent Writes** - Unlocks multi-writer use cases
|
|
5. **Route and Filter Domain Events** - Enables loose coupling
|
|
|
|
### Build Next (Expanding Use Cases)
|
|
|
|
6. **Rebuild State from Event History** - Audit and debugging; often implemented after core
|
|
7. **Recover from Node Failures** - Completes HA story
|
|
8. **Deliver Events Across Cluster Nodes** - NATS integration; final scale piece
|
|
9. **Isolate Logical Domains Using Namespaces** - Multi-tenancy; add when needed
|
|
|
|
### Consider Off-the-Shelf or Later
|
|
|
|
- **Metrics and Monitoring** - Use Prometheus exporter (vendor standard)
|
|
- **Distributed Tracing** - Integrate OpenTelemetry when debugging distributed flows
|
|
- **Access Control** - Add RBAC if multi-tenancy requires fine-grained permission control
|
|
|
|
### Architecture Insights
|
|
|
|
**NATS-Native Design:**
|
|
Aether is built for JetStream from the start, not bolted on. This means:
|
|
- Event storage directly uses JetStream (not a wrapper around Postgres)
|
|
- Pub/sub directly uses NATS subjects (not a custom message queue)
|
|
- Cluster coordination uses NATS for discovery and messaging (not gossip or Raft)
|
|
|
|
**Implication:** If you're already using NATS, Aether requires no additional infrastructure.
|
|
|
|
**Primitives Over Frameworks:**
|
|
Aether provides:
|
|
- EventStore interface (you choose implementation)
|
|
- EventBus interface (you choose local vs NATSEventBus)
|
|
- Consistent hashing (you compose it)
|
|
- Leader election (you decide what to do with leadership)
|
|
|
|
Implication: You own the domain model; Aether doesn't impose it.
|
|
|
|
**Capability-First Decomposition:**
|
|
Rather than thinking "features," think "what can the system do":
|
|
- "Store events durably" (capability) enables "event sourcing" (architectural pattern) enables "event-driven architecture" (design pattern)
|
|
|
|
This prevents feature churn and focuses implementation on value.
|
|
|
|
---
|
|
|
|
## Related Documents
|
|
|
|
- **Vision** ([/Users/hugo.nijhuis/src/github/flowmade-one/aether/vision.md](./vision.md)) - Product positioning and constraints
|
|
- **CLAUDE.md** (in this repo) - Architecture patterns and version semantics
|
|
- **Organization Manifesto** - [https://git.flowmade.one/flowmade-one/architecture/src/branch/main/manifesto.md](https://git.flowmade.one/flowmade-one/architecture/src/branch/main/manifesto.md)
|