fix(backend): randomizing output uri path to avoid overwriting. Fixes #10186 #11243

b4sus · 2024-09-24T10:02:17Z

In driver, random string is added when uri paths for output artifacts are generated. This should ensure that when component of certain name is executed in parallel (either with ParallelFor or just simply calling it multiple times in @pipeline), its outputs are always stored to different paths.

Signed-off-by: b4sus <[email protected]>

google-oss-prow · 2024-09-24T10:02:29Z

Hi @b4sus. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hbelmiro

/lgtm
/ok-to-test

HumairAK · 2024-09-27T21:32:35Z

Hey @b4sus , thanks for the contribution!

Can you provide a sample pipeline that illustrates the issue this pr is aiming to resolve?

At least in the case of a component being re-used, I believe the taskname will have a -# suffix, so should already be distinguished from repeated earlier calls. For parallelFor I'd be interested of its impact with #10798

cc @gmfrasca

gmfrasca · 2024-09-27T22:21:43Z

@HumairAK - This appears to only impact output artifacts, and only changes the driver behavior when in CONTAINER driver mode, so I don't believe this should have any effect on #10798 in terms of sub-DAG naming schemes, etc.

With that said, I did see that ParallelFor outputs are storing artifacts in the same URI, which is a problem that this PR addresses by adding UUID salts.

gmfrasca

Tested this out using a ParallelFor task and confirmed each iteration's output artifacts are given unique URIs which are referenced properly in KFP UI.

/lgtm

b4sus · 2024-09-30T08:52:30Z

Hey @HumairAK ,

We noticed the problem when, from one pipeline, we started other pipelines (pipeline as component) using ParallelFor. This is roughly the code:

@dsl.pipeline
def inner_pipeline(date_to_process: str):
    comp1_task = component1(date_to_process = date_to_process)
    comp2_task = component2(comp1_task.outputs["output_df"])

@dsl.pipeline
def main_pipeline(from_date: str, to_date: str):
    prepare_dates_task = prepare_dates_component(from_date = from_date, to_date = to_date)

    with dsl.ParallelFor(items = prepare_dates_task.output, parallelism=4) as date_to_process:
        inner_ppln_task = inner_pipeline(date_to_process = date_to_process)

In this case, many inner pipelines were started (more then 4 as parallelism is not yet supported) and problem was that output of component1 was/is written to the same minio location, so overwriting each other. And subsequently couple of component2 tasks get the same input, regardless of the argument (date_to_process), producing the same final output (not visible here in code as it is store directly in component).

HumairAK · 2024-10-03T20:37:50Z

Perfect, thanks guys

tested and works as well with the following pipeline:

pipeline.py

from typing import List

from kfp import dsl, compiler
from kfp.dsl import Dataset
from kfp.dsl import Output, InputPath

@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def component1(date_to_process: str, output_df: Output[Dataset]):
    with open(output_df.path, 'w') as f:
        f.write(date_to_process)

@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def component2(dataset_in: InputPath('Dataset')):
    with open(dataset_in, 'r') as input_file:
        dataset_one_contents = input_file.read()
    print(dataset_one_contents)

@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def prepare_dates_component() -> List[str]:
    return ["1", "2", "3", "4", "5", "6"]

@dsl.pipeline
def inner_pipeline(date_to_process: str):
    comp1_task = component1(date_to_process = date_to_process).set_caching_options(enable_caching=False)
    comp2_task = component2(dataset_in = comp1_task.outputs["output_df"]).set_caching_options(enable_caching=False)

@dsl.pipeline
def main_pipeline():
    prepare_dates_task = prepare_dates_component().set_caching_options(enable_caching=False)

    with dsl.ParallelFor(items = prepare_dates_task.output, parallelism=4) as date_to_process:
        inner_ppln_task = inner_pipeline(date_to_process = date_to_process)


if __name__ == '__main__':
    compiler.Compiler().compile(main_pipeline, __file__ + '.yaml')

before:

after:

/lgtm
/approve

google-oss-prow · 2024-10-03T20:38:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~backend/OWNERS~~ [HumairAK]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wassimbensalem · 2025-02-21T14:54:52Z

Perfect, thanks guys

tested and works as well with the following pipeline:

pipeline.py

from typing import List

from kfp import dsl, compiler
from kfp.dsl import Dataset
from kfp.dsl import Output, InputPath

@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def component1(date_to_process: str, output_df: Output[Dataset]):
    with open(output_df.path, 'w') as f:
        f.write(date_to_process)

@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def component2(dataset_in: InputPath('Dataset')):
    with open(dataset_in, 'r') as input_file:
        dataset_one_contents = input_file.read()
    print(dataset_one_contents)

@dsl.component(base_image="quay.io/opendatahub/ds-pipelines-ci-executor-image:v1.0")
def prepare_dates_component() -> List[str]:
    return ["1", "2", "3", "4", "5", "6"]

@dsl.pipeline
def inner_pipeline(date_to_process: str):
    comp1_task = component1(date_to_process = date_to_process).set_caching_options(enable_caching=False)
    comp2_task = component2(dataset_in = comp1_task.outputs["output_df"]).set_caching_options(enable_caching=False)

@dsl.pipeline
def main_pipeline():
    prepare_dates_task = prepare_dates_component().set_caching_options(enable_caching=False)

    with dsl.ParallelFor(items = prepare_dates_task.output, parallelism=4) as date_to_process:
        inner_ppln_task = inner_pipeline(date_to_process = date_to_process)


if __name__ == '__main__':
    compiler.Compiler().compile(main_pipeline, __file__ + '.yaml')

before:

after:

/lgtm /approve

Should we use the exact same base image? I ran it without that image, and it failed. Additionally, I'm unable to read the artifacts in a loop—any idea why?

I used the exact same code but still couldn't read the artifact in a loop. Component 2 doesn't even start, and this issue keeps occurring.

Moreover, when not using that base image, it fails to generate a random UUID for my paths and always overwrites the file.

Any Help Please ? It's blocking our team, thanks !

HumairAK · 2025-02-23T01:14:50Z

@wassimbensalem I think you might me encountering a couple of different issues, can you reach out in the cncf kfp platform slack with the errors you are encountering, for more context around this base image you can find the dockerfile here

this sounds related to this isssue which was recently resolved in master branch, let us know in slack if that is indeed the case, if not I suggest creating a new issue with a reproducible pipeline and information around the platform you are using as well as kfp/ kfp sdk / k8s version.

Added uuid to outputs uri path to avoid overwriting artifacts

aa16076

Signed-off-by: b4sus <[email protected]>

google-oss-prow bot requested review from HumairAK and rimolive September 24, 2024 10:02

google-oss-prow bot added size/XS needs-ok-to-test labels Sep 24, 2024

b4sus mentioned this pull request Sep 24, 2024

[backend] Artifacts of Components in ParallelFor / Sub-DAGs Overwritten by Concurrent Iterations #10186

Closed

hbelmiro reviewed Sep 24, 2024

View reviewed changes

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels Sep 24, 2024

google-oss-prow bot assigned hbelmiro Sep 24, 2024

google-oss-prow bot added the lgtm label Sep 24, 2024

hbelmiro mentioned this pull request Sep 27, 2024

chore: adding @hbelmiro to backend reviewers #11256

Merged

2 tasks

gmfrasca reviewed Sep 27, 2024

View reviewed changes

google-oss-prow bot assigned gmfrasca Sep 27, 2024

google-oss-prow bot assigned HumairAK Oct 3, 2024

google-oss-prow bot added the approved label Oct 3, 2024

google-oss-prow bot merged commit 219725d into kubeflow:master Oct 3, 2024
13 checks passed

b4sus deleted the fix_backend_overwriting_artifacts branch October 4, 2024 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(backend): randomizing output uri path to avoid overwriting. Fixes #10186 #11243

fix(backend): randomizing output uri path to avoid overwriting. Fixes #10186 #11243

Uh oh!

b4sus commented Sep 24, 2024

Uh oh!

google-oss-prow bot commented Sep 24, 2024

Uh oh!

hbelmiro left a comment

Uh oh!

HumairAK commented Sep 27, 2024

Uh oh!

gmfrasca commented Sep 27, 2024

Uh oh!

gmfrasca left a comment

Uh oh!

b4sus commented Sep 30, 2024

Uh oh!

HumairAK commented Oct 3, 2024 •

edited

Loading

Uh oh!

google-oss-prow bot commented Oct 3, 2024

Uh oh!

Uh oh!

wassimbensalem commented Feb 21, 2025 •

edited

Loading

Uh oh!

HumairAK commented Feb 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix(backend): randomizing output uri path to avoid overwriting. Fixes #10186 #11243

fix(backend): randomizing output uri path to avoid overwriting. Fixes #10186 #11243

Uh oh!

Conversation

b4sus commented Sep 24, 2024

Uh oh!

google-oss-prow bot commented Sep 24, 2024

Uh oh!

hbelmiro left a comment

Choose a reason for hiding this comment

Uh oh!

HumairAK commented Sep 27, 2024

Uh oh!

gmfrasca commented Sep 27, 2024

Uh oh!

gmfrasca left a comment

Choose a reason for hiding this comment

Uh oh!

b4sus commented Sep 30, 2024

Uh oh!

HumairAK commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Oct 3, 2024

Uh oh!

Uh oh!

wassimbensalem commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HumairAK commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HumairAK commented Oct 3, 2024 •

edited

Loading

wassimbensalem commented Feb 21, 2025 •

edited

Loading

HumairAK commented Feb 23, 2025 •

edited

Loading