Skip to content

feat: KEP-2655: Add data cache system#2755

Merged
google-oss-prow[bot] merged 1 commit into
kubeflow:masterfrom
akshaychitneni:cache-oss
Sep 22, 2025
Merged

feat: KEP-2655: Add data cache system#2755
google-oss-prow[bot] merged 1 commit into
kubeflow:masterfrom
akshaychitneni:cache-oss

Conversation

@akshaychitneni
Copy link
Copy Markdown
Contributor

@akshaychitneni akshaychitneni commented Jul 28, 2025

What this PR does / why we need it:
This PR adds distributed data cache as described in https://docs.google.com/document/d/1xj3K6bOT4f0EPiC4zwr2OsbRkROjIA8u_TBRZRVxrHI
Followup PRs will include SDK and trainer integrations

Proposal - kubeflow/community#864
KEP - #2655

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes # #2757

Checklist:

  • Docs included if any changes are user facing

@akshaychitneni akshaychitneni changed the title KEP-2655: Add data cache system feat - KEP-2655: Add data cache system Jul 28, 2025
@coveralls
Copy link
Copy Markdown

coveralls commented Jul 28, 2025

Pull Request Test Coverage Report for Build 17837270829

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 55.137%

Totals Coverage Status
Change from base Build 17772927799: 0.0%
Covered Lines: 1084
Relevant Lines: 1966

💛 - Coveralls

Comment thread pkg/data_cache/src/client/src/main.rs Outdated
Comment thread pkg/data_cache/src/head/head.rs
Comment thread pkg/data_cache/src/head/head.rs Outdated
Comment thread pkg/data_cache/src/head/writer.rs Outdated
Comment thread pkg/data_cache/src/worker/indexable_mem_table.rs Outdated
Comment thread pkg/data_cache/src/worker/indexable_mem_table.rs Outdated
Comment thread pkg/data_cache/src/worker/indexable_mem_table.rs Outdated
Comment thread pkg/data_cache/src/worker/worker_service.rs Outdated
Comment thread pkg/data_cache/src/head/head.rs Outdated
Comment thread pkg/data_cache/README.md
Comment thread pkg/data_cache/README.md Outdated
@akshaychitneni akshaychitneni force-pushed the cache-oss branch 2 times, most recently from 3b7359b to b640ffa Compare August 8, 2025 03:27
Comment thread pkg/data_cache/Cargo.toml
Comment on lines +11 to +12
arrow = "55.0.0"
arrow-schema = "55.0.0"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we migrate to Arrow 55.2 as @comphead suggested ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using datafusion 47.0.0 and it builds on arrow 55.0.0 - https://github.com/apache/datafusion/blob/e4433049b04ca2c1e2031eb05d1a0990210f11d6/Cargo.toml#L90. Let me try to address as a followup which might require additional validation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshaychitneni Can you create tracking issue for that please ?
We can address it in the followup PRs

Comment thread pkg/data_cache/OWNERS Outdated
Comment thread pkg/data_cache/Cargo.toml Outdated
Comment thread pkg/data_cache/Cargo.toml Outdated
Comment thread pkg/data_cache/src/lib.rs Outdated
Comment thread pkg/data_cache/src/head/head.rs Outdated
Comment thread pkg/data_cache/src/head/writer.rs
Comment thread pkg/data_cache/src/head/head_service.rs Outdated
Comment on lines +259 to +308
#[derive(Serialize, Deserialize, Debug)]
struct IndexPair {
start: u64,
end: u64,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this struct from main.rs ?

Comment thread pkg/data_cache/src/head/head.rs Outdated

pub async fn init(&mut self) -> Result<()> {
let _ = self.fetch_data_files().await;
let df = self.ctx.sql("select * from memtable").await?.collect().await?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does DataFusion create the memtable for us ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we register a table with name memtable here - https://github.com/kubeflow/trainer/pull/2755/files#diff-d6c85a17f636003dada6db08feb3052bf87b12f6672723af5b503ec390d9c980R94. Probably I will use the var instead of using name directly.

Comment thread pkg/data_cache/Cargo.toml Outdated
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshaychitneni I left a few comments.
Overall looks good.

Can you fix the CI please ?

/cc @kubeflow/kubeflow-trainer-team @rudeigerc @raravena80 @comphead
in case you want to provide any additional comments.

Comment thread cmd/data_cache/Dockerfile
Comment thread .gitignore Outdated
Comment thread pkg/data_cache/README.md Outdated
Comment thread pkg/data_cache/src/head/head_service.rs Outdated
Comment thread pkg/data_cache/src/head/head.rs Outdated
Comment thread pkg/data_cache/src/worker/worker_service.rs Outdated
Comment thread pkg/data_cache/src/worker/indexable_mem_table.rs
Comment thread pkg/data_cache/Cargo.toml
Comment on lines +11 to +12
arrow = "55.0.0"
arrow-schema = "55.0.0"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshaychitneni Can you create tracking issue for that please ?
We can address it in the followup PRs

Comment thread pkg/data_cache/test/src/main.rs Outdated
@google-oss-prow
Copy link
Copy Markdown

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-trainer-team, raravena80.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

@akshaychitneni I left a few comments.
Overall looks good.

Can you fix the CI please ?

/cc @kubeflow/kubeflow-trainer-team @rudeigerc @raravena80 @comphead
in case you want to provide any additional comments.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich
Copy link
Copy Markdown
Member

@akshaychitneni Please can you also add a new scope to the PR title verification, e.g. cache, so we can do feat(cache):...: https://github.com/kubeflow/trainer/blob/master/.github/workflows/check-pr-title.yaml#L28

@andreyvelich andreyvelich added this to the v2.1 milestone Sep 17, 2025
@akshaychitneni akshaychitneni changed the title feat - KEP-2655: Add data cache system feat(cache) - KEP-2655: Add data cache system Sep 17, 2025
Comment on lines +1561 to +1573
[[package]]
name = "tracing-subscriber"
version = "0.3.19"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e8189decb5ac0fa7bc8b96b7cb9b2701d60d48805aca84a238004d665fcc4008"
dependencies = [
"nu-ansi-term",
"sharded-slab",
"smallvec",
"thread_local",
"tracing-core",
"tracing-log",
]

Check notice

Code scanning / Trivy

tracing-subscriber: Tracing log pollution Low test

Package: tracing-subscriber
Installed Version: 0.3.19
Vulnerability CVE-2025-58160
Severity: LOW
Fixed Version: 0.3.20
Link: CVE-2025-58160
@@ -0,0 +1,51 @@
use std::env;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshaychitneni Could you also move main files for head and worker to the cmd Similar to controller manger: https://github.com/kubeflow/trainer/tree/master/cmd/trainer-controller-manager ?
I would suggest this structure:

cmd/cache/Dockerfile
cmd/cache/head/main.rs
cmd/cache/worker/main.rs

WDYT ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I am not sure if Cargo supports such folder structure:

[[bin]]
name = "head"
path = "cmd/head/main.rs"
[[bin]]
name = "worker"
path = "cmd/worker/main.rs"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be helpful as it involves moving outside of cargo

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this awesome effort 🚀
/lgtm
/assign @astefanutti @tenzen-y @Electronic-Waste @kubeflow/wg-training-leads @kubeflow/kubeflow-trainer-team
Please let us know if you have any additional comments before move this forward.

@google-oss-prow google-oss-prow Bot added the lgtm label Sep 18, 2025
@google-oss-prow google-oss-prow Bot removed the lgtm label Sep 18, 2025
@andreyvelich andreyvelich changed the title feat(cache): KEP-2655: Add data cache system feat: KEP-2655: Add data cache system Sep 18, 2025
@andreyvelich
Copy link
Copy Markdown
Member

/lgtm

@google-oss-prow google-oss-prow Bot added the lgtm label Sep 18, 2025
Signed-off-by: Akshay Chitneni <achitneni@apple.com>
@andreyvelich
Copy link
Copy Markdown
Member

/lgtm

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move forward with this PR and address the remaining comments in the followup changes.
Thanks again for this huge effort @akshaychitneni!
/approve

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit bbaca24 into kubeflow:master Sep 22, 2025
34 of 35 checks passed
alexxfan pushed a commit to red-hat-data-services/trainer that referenced this pull request Nov 24, 2025
Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Co-authored-by: Akshay Chitneni <achitneni@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants