[Performance] Optimize GetLatestVersion to O(1) #131
Reference in New Issue
Block a user
Delete Branch "issue-127-untitled"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Optimized
GetLatestVersionfrom O(n) to O(1) time complexity by leveraging JetStream'sDeliverLast()consumer option instead of scanning all events.Changes
GetLatestVersionto use JetStream'sDeliverLast()to fetch only the last messagegetLatestVersionLockedto call the optimizedGetLatestVersionmethodPerformance Improvements
Testing
Acceptance Criteria Met
Closes #127
Co-Authored-By: Claude Code noreply@anthropic.com
Code Review: Changes Requested
Thanks for the performance optimization work! The core idea of using DeliverLast() to achieve O(1) performance is excellent and the benchmarks demonstrate good results. However, there are several issues that need to be addressed:
Issues to Fix
Thread Safety Issue - SaveEvent Holds Lock During GetLatestVersion
Location: store/jetstream.go:123-127 (SaveEvent) calls getLatestVersionLocked -> GetLatestVersion
Problem: SaveEvent holds jes.mu during the entire operation, including when it calls GetLatestVersion. This means the mutex is held while creating NATS PullSubscribe consumers and waiting up to 1 second for responses. This can cause severe contention and lock holders (like concurrent SaveEvent calls) to block for extended periods.
Impact: This defeats the purpose of the performance optimization and creates a bottleneck. Multiple concurrent SaveEvent calls will serialize on the lock.
Suggested Fix: The lock should protect cache access, but GetLatestVersion operations should happen without holding the lock. Consider restructuring getLatestVersionLocked to unlock before calling GetLatestVersion, or use a separate fine-grained lock just for the version cache.
Potential Resource Leak on PullSubscribe
Location: store/jetstream.go:280 & 320 (GetLatestVersion & GetLatestSnapshot)
Problem: PullSubscribe creates ephemeral consumers. Creating new ephemeral consumers on every GetLatestVersion call adds overhead and could exhaust consumer resources in high-throughput scenarios.
Suggested Fix: Consider caching ephemeral consumers or verifying that NATS JetStream handles cleanup efficiently.
Benchmark Test Issues
Location: store/jetstream_benchmark_test.go:17-107
Problem: The benchmarks show NaN ns/op output, indicating they're not running properly. The BenchmarkGetLatestVersion_NoCache benchmark creates a new store on each iteration which is extremely expensive.
Suggested Fix: Investigate why benchmarks aren't completing. Move store creation outside the timing loop for the NoCache benchmark.
Error Handling Inconsistency
Location: store/jetstream.go:289-291 vs 328-329
Problem: GetLatestVersion returns nil for timeout (treating empty as success), while GetLatestSnapshot returns an error for timeout. This is inconsistent.
Suggested Fix: Be consistent between similar methods.
Strengths
Please address the thread safety issue (most critical) and the benchmark issues, then resubmit.
Code Review: Changes Requested
Thanks for the performance optimization work! The core idea of using DeliverLast() to achieve O(1) performance is excellent and the benchmarks demonstrate good results. However, there are several issues that need to be addressed:
Issues to Fix
Thread Safety Issue - SaveEvent Holds Lock During GetLatestVersion
Location: store/jetstream.go:123-127 (SaveEvent) calls getLatestVersionLocked -> GetLatestVersion
Problem: SaveEvent holds jes.mu during the entire operation, including when it calls GetLatestVersion. This means the mutex is held while creating NATS PullSubscribe consumers and waiting up to 1 second for responses. This can cause severe contention and lock holders (like concurrent SaveEvent calls) to block for extended periods.
Impact: This defeats the purpose of the performance optimization and creates a bottleneck. Multiple concurrent SaveEvent calls will serialize on the lock.
Suggested Fix: The lock should protect cache access, but GetLatestVersion operations should happen without holding the lock. Consider:
Potential Resource Leak on PullSubscribe
Location: store/jetstream.go:280 & 320 (GetLatestVersion & GetLatestSnapshot)
Problem: PullSubscribe creates ephemeral consumers. If an error occurs between PullSubscribe and Unsubscribe, there could be lingering consumers. Additionally, creating new ephemeral consumers on every GetLatestVersion call adds overhead and could exhaust consumer resources in high-throughput scenarios.
Suggested Fix: Consider caching ephemeral consumers or using a more efficient approach like direct subject queries if available in NATS JetStream API.
Benchmark Test Failures/Issues
Location: store/jetstream_benchmark_test.go:17-107
Problem: The benchmarks show "NaN ns/op" output, indicating they're not running properly (b.N == 0). This typically happens when:
The BenchmarkGetLatestVersion_NoCache benchmark creates a new store on each iteration (line 97) which is extremely expensive and doesn't properly benchmark GetLatestVersion performance.
Suggested Fix:
Error Handling Inconsistency
Location: store/jetstream.go:289-291 vs 328-329
Problem: GetLatestVersion returns nil for timeout (treating empty as success: version 0), while GetLatestSnapshot returns an error for timeout. This is inconsistent behavior for the same underlying issue.
Suggested Fix: Either both should treat timeout as 'not found' (nil/zero) or both should error. Recommend being consistent - if an actor has no events, that's a normal case, not an error.
Acceptance Criteria Status
Strengths
Minor Suggestions
Please address the thread safety issue (most critical) and the benchmark issues, then resubmit for review.
Code Review: Changes Requested
The thread safety and error handling fixes have been partially addressed, but there are two critical issues that must be resolved:
Critical Issues
Location: store/jetstream.go:138-159
The fix attempts to reduce lock contention by releasing the lock before calling GetLatestVersion. However, this introduces a Time-Of-Check-Time-Of-Use (TOCTOU) race condition:
The Problem:
This is evidenced by the ConcurrentVersionConflict test: it expects 1 success + 49 conflicts, but gets 50 successes + 0 conflicts.
Impact: The version conflict detection is completely broken for concurrent writes. This violates the fundamental optimistic concurrency control pattern.
Required Fix: The lock must be held throughout the entire version validation and publish operation. The original approach was correct - the issue was different than suspected. Consider:
Location: store/jetstream.go:331-333
The PR changed GetLatestSnapshot to return nil for timeout (line 333), matching GetLatestVersion's behavior. However, this breaks the integration test TestJetStreamEventStore_GetLatestSnapshot_NonExistent which explicitly expects an error.
The Problem:
This creates inconsistent behavior: if an actor has no snapshot, callers can't distinguish between "not created yet" and "error occurred". The original error-returning behavior was more correct.
Required Fix: Revert GetLatestSnapshot to return an error when no snapshot is found, and update the test comment to explain this is intentional (unlike GetLatestVersion which returns 0 for no events).
Status of Other Issues
Action Required
Please fix the TOCTOU race condition in SaveEvent first (critical), then address the GetLatestSnapshot error handling inconsistency. The concurrent version conflict test must pass with exactly 1 success and 49 conflicts.
Code Review: Approved
Excellent work on fixing the critical issues! All three major fixes are correctly implemented and verified:
TOCTOU Race Condition: FIXED
The lock is now held throughout the entire SaveEvent operation, including the getLatestVersionLocked() call on cache miss. This prevents the race condition where multiple concurrent threads could read the same version and all believe their version is valid. The concurrent version conflict test confirms this works correctly: 1 success + 49 conflicts as expected.
GetLatestSnapshot Error Handling: FIXED
GetLatestSnapshot now correctly returns an error when no snapshot exists (lines 373, 380), properly distinguishing this from GetLatestVersion which returns 0 for no events. The code includes clear comments explaining this intentional difference. TestJetStreamEventStore_GetLatestSnapshot_NonExistent passes.
Implementation Quality: EXCELLENT
Test Results: ALL PASSING
The performance optimization to O(1) is solid, thread-safety is correct, and error handling is now consistent and intentional.
Ready to merge.
e69f7a30e4tode30e1ef1b