Skip to content

feat(cache): add object versioning and standalone cache binary (RFE426)#427

Open
k82cn wants to merge 17 commits intoxflops:mainfrom
k82cn:flm_426
Open

feat(cache): add object versioning and standalone cache binary (RFE426)#427
k82cn wants to merge 17 commits intoxflops:mainfrom
k82cn:flm_426

Conversation

@k82cn
Copy link
Copy Markdown
Contributor

@k82cn k82cn commented Apr 29, 2026

Summary

This PR implements RFE426: Object Versioning for Client-Side Cache, with two major components:

  1. Standalone flame-object-cache binary - Extract object cache from executor-manager into dedicated process
  2. Object versioning & client-side caching - Version tracking with conditional GET for efficient caching

Changes

Server-Side (Rust)

Standalone Binary:

  • Convert object_cache crate from library to binary (remove lib.rs, add main.rs)
  • Remove embedded cache startup from executor_manager/src/main.rs
  • Add Dockerfile.foc for container builds
  • Add service to compose.yaml with depends_on for proper ordering

Version Tracking:

  • Initialize object version to 1 on PUT operation
  • Increment version on PATCH operation
  • Implement conditional GET via do_get ticket format: {key}:{client_version}
  • Return empty Arrow Flight stream when client version matches (not modified)
  • Return full object as RecordBatch when versions differ

ObjectKey Enhancements:

  • Add wildcard session support (app/*) for bulk delete operations
  • Refactor key parsing to use ObjectKey struct throughout
  • Add is_all_sessions(), matches(), with_object_id() methods

Client-Side (Python SDK)

Client-Side Cache:

  • Add _object_cache: Dict[tuple, Object] with thread-safe _cache_lock
  • get_object() always checks with server, returns cached data on version match
  • version=0 bypasses cache (unconditional fetch)
  • update_object() and patch_object() invalidate cache after mutation

New Functions:

  • delete_objects(key_prefix) - Delete objects by prefix with wildcard support

Deployment

flmadm:

  • Add InstallProfile::Cache and --cache flag
  • Add flame-object-cache to BuildArtifacts
  • Add systemd service template with MemoryMax=12G
  • Include cache in --all profile

CI/Docker:

  • Update ci/flame-cluster*.yaml cache endpoints to flame-object-cache:9090
  • Add flame-object-cache service to compose.yaml
  • Executor-manager now depends_on object-cache

Documentation

  • Add docs/designs/RFE426-cache-versioning/FS.md - Full design document
  • Add docs/designs/RFE426-cache-versioning/STATUS.md - Implementation checklist
  • Update docs/designs/RFE318-cache/FS.md - Add ObjectKey and wildcard docs
  • Update object_cache/README.md - Wire protocol details, accurate API table
  • Update flmadm/README.md - Cache installation instructions

Wire Protocol

Operation Description
do_put Upload/update object (returns ObjectRef with version)
do_put (PATCH cmd) Append delta (FlightDescriptor.for_command("PATCH:{key}"))
do_get Conditional get (ticket: {key}:{version}, empty stream = not modified)
do_action(DELETE) Delete by prefix (supports {app}/* wildcard)

Testing

  • Rust unit tests for version initialization, increment, conditional GET
  • Python unit tests for cache hit/miss, version mismatch, bypass with version=0

Closes #426

Extract object cache into standalone flame-object-cache binary and implement
client-side caching with version tracking for efficient conditional gets.

Server-side changes:
- Convert object_cache to standalone binary (remove lib.rs, add main.rs)
- Initialize object version to 1 on PUT, increment on PATCH
- Implement conditional GET via do_get ticket format: key:version
- Return empty stream when client version matches (not modified)
- Add wildcard session support for bulk delete (app/*)
- Remove embedded cache from executor-manager

Client-side changes (Python SDK):
- Add client-side object cache with thread-safe access
- Implement version checking in get_object (always verify with server)
- Support version=0 for unconditional fetch (bypass cache)
- Invalidate cache on update_object/patch_object mutations
- Add delete_objects function with wildcard support

Deployment:
- Add Dockerfile.foc for standalone cache container
- Add flame-object-cache service to docker-compose.yaml
- Update CI configs to point cache endpoint to flame-object-cache
- Add --cache profile to flmadm install command
- Add systemd service template with MemoryMax=12G

Documentation:
- Add RFE426 design document and status
- Update RFE318 design with ObjectKey and wildcard session
- Update object_cache README with wire protocol details
- Update flmadm README with cache installation instructions
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extracts the object cache from the executor manager into a standalone service, flame-object-cache, and implements object versioning with client-side caching in the Python SDK. Key changes include updating deployment configurations, adding a new installation profile in flmadm, and modifying the storage layer to support application-scoped keys and wildcard session deletions. Feedback highlights critical issues where the client-side cache ignores custom deserializers on cache hits and the server-side versioning logic incorrectly resets version numbers to one during updates or when metadata is evicted from memory. Additionally, improvements are suggested for the validation of wildcard characters in object keys to prevent inconsistent state.

Comment on lines +402 to +406
if cached_version > 0:
with _cache_lock:
cached = _object_cache.get(cache_key)
if cached is not None:
return cached.data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

On a cache hit (server returns not_modified), the code returns cached.data directly. However, cached.data contains the result of the first deserialization performed when the object was originally cached. If a user calls get_object with a different deserializer function on a cache hit, the new deserializer is ignored, and the old result is returned. The cache should either store the raw base and deltas to allow re-deserialization, or the cache key must include the deserializer identity.

Comment thread object_cache/src/cache.rs Outdated
let size = object.data.len() as u64;

self.storage.write_object(&key, &object).await?;
let versioned_object = Object::new(1, object.data);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The put operation always initializes the object version to 1. This is correct for new objects, but put is also used for UPDATE operations (overwriting existing objects). In the case of an update, the version should be incremented from the current version rather than reset to 1. Resetting the version breaks the monotonicity requirement for client-side caching.

        let current_version = {
            let metadata = lock_ptr!(self.metadata)?;
            metadata.get(&key_str).map(|m| m.version).unwrap_or(0)
        };
        let new_version = current_version + 1;
        let versioned_object = Object::new(new_version, object.data);

        self.storage.write_object(&key, &versioned_object).await?;

        let meta = self.create_metadata(key_str.clone(), new_version, size, 0);

Comment thread object_cache/src/cache.rs Outdated
Comment on lines +486 to +490
let current_version = {
let metadata = lock_ptr!(self.metadata)?;
metadata.get(&key_str).map(|m| m.version).unwrap_or(0)
};
let new_version = current_version + 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The patch operation calculates the new_version based on the in-memory metadata map. If an object has been evicted from memory but remains on disk, metadata.get(&key_str) will return None, causing the version to reset to 1 (0 + 1). This violates version monotonicity. The current version should be retrieved from the storage engine if it's not present in the in-memory metadata.

Comment thread object_cache/src/cache.rs
Comment on lines +72 to +76
if *part == WILDCARD_SESSION && i != 1 {
return Err(FlameError::InvalidConfig(
"Wildcard '*' only allowed for session_id".to_string(),
));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The wildcard character * is currently only forbidden for app_name and object_id via the loop index check, but it should be explicitly forbidden in all components except when used as a standalone session_id. Additionally, app/session/* is currently accepted as a valid 3-part key where the object ID is literally *, which contradicts the design that wildcard keys cannot reference specific objects.

k82cn added 5 commits April 29, 2026 13:06
- Fix version monotonicity: PUT now increments version for existing objects
  instead of resetting to 1. Check memory first, then storage for current version.
- Fix PATCH version lookup: fetch current version from storage when metadata
  is evicted from memory to maintain monotonicity.
- Fix wildcard validation: explicitly reject '*' in object_id position.
- Apply cargo fmt formatting fixes.
Remove hardcoded /etc/flame/flame-cluster.yaml default and use
FlameClusterContext::from_file(None) which searches standard locations
(~/.flame/flame-cluster.yaml) like other Flame services.
k82cn added 6 commits April 29, 2026 14:25
Arrow Flight requires at least a schema message even for empty responses.
Return schema-only FlightData instead of completely empty stream when
client version matches server version (not modified case).
- Stop services before restarting to avoid race conditions
- Start services in correct dependency order with delays
- Add ./work/cache mount to flame-console and flame-executor-manager
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

k82cn added 4 commits April 29, 2026 17:07
- Enable RUST_LOG=debug in compose.yaml for all services
- Add detailed logging to RecursiveService.compute_recursive() with exception tracing
- Add timing measurements to test_runner_recursive_same_session
- Add logging to RunnerService session opening and task submission
- Add logging to Runner._start() for app existence checking
- Add logging to cache.py for put/get/update/patch operations
- Use common::init_logger for object_cache to write to log file
…entation

- Revert RUST_LOG to info in compose.yaml
- Remove verbose debug logs from cache.py, keep only error logs
- Simplify RunnerService logs to essential session opening info
- Simplify RecursiveService logs to show depth and key state changes
- Simplify test logs to show depth, result, and timing only
- Change logging.basicConfig to INFO level in recursive test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add version to object in cache

1 participant