Skip to content

Conversation

avishniakov
Copy link
Contributor

@avishniakov avishniakov commented Feb 19, 2024

Describe changes

This PR solve a few parallelization issues we had:

  • save_artifact logic is improved, so it is now tolerant to parallel creation of Artifact and has a retry logic to create a new Artifact Versions for those without an explicit version name
  • Model Version creation is now also tolerant to parallel execution and equipped with retry logic to ensure that parallel runs do not get dumped into same Model Version
  • Both cases are loaded with heavy parallel test cases to proof it is sustainable on a long run
  • New MAX_RETRIES_FOR_VERSIONED_ENTITY_CREATION constant introduced and set to 10 reties for now with 0.2 seconds of cooldown growing by retry count (e.g. 0.2 * retry_num). 10 is quite empirical and might need some further tunning.

Tiny side improvement:

  • Decreased logging warning for Model config mismatch, because it doesn't make sense in production usage, based on user feedback and is only annoying ( OSSK-364 )

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

Summary by CodeRabbit

  • New Features

    • Enhanced artifact and model version creation with improved error handling and a retry mechanism.
    • Implemented efficient pipeline registration and reuse functionality.
    • Added a new test case for verifying parallel artifact registration in pipelines.
    • Introduced a script for running pipelines with parallel steps and registering artifacts.
  • Bug Fixes

    • Refined error handling for existing artifacts and model versions to prevent duplicate entries.
  • Refactor

    • Streamlined artifact creation logic for better performance and reliability.
    • Removed outdated model configuration comparison logic.
  • Tests

    • Added integration tests for parallel model version creation and pipeline execution.

Copy link
Contributor

coderabbitai bot commented Feb 19, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Walkthrough

The updates focus on enhancing the artifact and model versioning system in ZenML, introducing a retry mechanism for creation processes, and improving error handling for entity existence. Changes include the addition of a new constant for maximum retries, refactoring of artifact creation logic, and implementation of efficient registration and reuse strategies for pipelines. Additionally, integration tests have been expanded to cover parallel creation scenarios for models and pipelines, ensuring robustness in heavily parallelized environments.

Changes

File(s) Summary
src/zenml/.../utils.py, src/zenml/.../model.py, src/zenml/.../pipeline.py, src/zenml/.../sql_zen_store.py Enhanced artifact and model versioning with retry mechanisms and improved error handling. Introduced a new constant MAX_RETRIES_FOR_VERSIONED_ENTITY_CREATION.
src/zenml/constants.py Added MAX_RETRIES_FOR_VERSIONED_ENTITY_CREATION constant.
tests/integration/functional/model/test_model_version.py, tests/integration/functional/pipelines/test_pipeline_parallel.py, tests/integration/functional/pipelines/util_parallel_pipeline_script.py Expanded integration tests for parallel model and pipeline version creation, including artifact registration verification.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@avishniakov
Copy link
Contributor Author

@coderabbitai review

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 10

Configuration used: .coderabbit.yaml

Commits Files that changed from the base of the PR and between dfe71ed and a88c99f.
Files selected for processing (8)
  • src/zenml/artifacts/utils.py (4 hunks)
  • src/zenml/constants.py (1 hunks)
  • src/zenml/model/model.py (5 hunks)
  • src/zenml/new/pipelines/pipeline.py (3 hunks)
  • src/zenml/zen_stores/sql_zen_store.py (4 hunks)
  • tests/integration/functional/model/test_model_version.py (5 hunks)
  • tests/integration/functional/pipelines/test_pipeline_parallel.py (1 hunks)
  • tests/integration/functional/pipelines/util_parallel_pipeline_script.py (1 hunks)
Additional comments: 15
tests/integration/functional/pipelines/util_parallel_pipeline_script.py (3)
  • 8-10: The register_artifact step function is correctly defined with caching disabled, which is suitable for testing parallel executions where caching could interfere with the test's integrity. The return value simulates an artifact registration process.
  • 13-16: The parallel_ pipeline function iterates over a range of steps_count and calls register_artifact for each iteration. This setup is appropriate for testing parallel executions of artifact registration. However, it's important to ensure that the steps_count and run_id parameters are correctly passed and used, especially in a parallel execution context.
  • 19-21: The script execution entry point correctly parses command line arguments to extract run_prefix, i, and steps_count. It's crucial that these arguments are validated and correctly converted to their expected types (e.g., steps_count and i should be integers) to avoid runtime errors.
tests/integration/functional/pipelines/test_pipeline_parallel.py (1)
  • 22-59: The test method test_parallel_runs_can_register_same_artifact is well-structured and follows a clear logic to test parallel artifact registration. It uses subprocesses to execute the pipeline script in parallel, which is a suitable approach for this test scenario. The assertions at the end of the test method are comprehensive, checking for the completion status of pipeline runs, the registration of all artifacts, their values, and unique versions. This thorough approach ensures that the parallel execution logic works as expected.
src/zenml/constants.py (1)
  • 318-320: The introduction of MAX_RETRIES_FOR_VERSIONED_ENTITY_CREATION with a value of 10 is a sensible addition to handle parallelized tests for versioned entity creation. The comment "empirical value to pass heavy parallelized tests" provides context for the choice of value, though it might be beneficial to include more detail on how this value was determined or any specific scenarios it addresses.
tests/integration/functional/model/test_model_version.py (2)
  • 14-14: The import of multiprocessing is necessary for the new test that validates parallel model version creation. This aligns with the PR's objective to improve parallel handling.
  • 119-120: The function parallel_model_version_creation is introduced to simulate the parallel creation of model versions. It directly calls a method on the Model class to either get an existing model version or create a new one. This function is crucial for the new test that assesses the system's ability to handle parallel model version creation without conflicts or errors.
src/zenml/model/model.py (2)
  • 16-16: The import of the time module is correctly added to support the sleep functionality used in the retry mechanism. This is a necessary addition for implementing delays between retries.
  • 29-29: The import of MAX_RETRIES_FOR_VERSIONED_ENTITY_CREATION is correctly added and is essential for defining the maximum number of retries in the retry mechanism for creating model versions. This constant plays a crucial role in controlling the retry behavior.
src/zenml/artifacts/utils.py (5)
  • 20-20: The import of the time module is correctly added to support the sleep functionality used in the retry mechanism for artifact version creation. This is a necessary addition for implementing delays between retries.
  • 27-30: The addition of the MAX_RETRIES_FOR_VERSIONED_ENTITY_CREATION constant is correctly implemented. It's well-placed within the imports section, ensuring that it's available throughout the file. This constant is crucial for controlling the retry behavior in artifact and model version creation processes.
  • 38-41: The inclusion of the EntityExistsError in the imports section is appropriate, given its usage in the updated save_artifact function to handle cases where an artifact version already exists. This change aligns with the PR's objective to improve error handling in parallel execution scenarios.
  • 118-118: The documentation for the save_artifact function has been updated to include EntityExistsError under the Raises section. This accurately reflects the changes made to the function's implementation, ensuring that users are aware of the potential exceptions that can be raised.
  • 248-250: Raising EntityExistsError when the artifact version creation fails after all retries is appropriate and aligns with the PR's objectives to improve error handling. This ensures that the caller is informed of the failure to create a unique artifact version, which is crucial in parallel execution environments.
src/zenml/new/pipelines/pipeline.py (1)
  • 57-57: The import of EntityExistsError is correctly added to handle specific exceptions related to entity existence conflicts during pipeline registration. This aligns with the PR objectives of improving error handling for parallel operations.

@github-actions github-actions bot added internal To filter out internal PRs and issues bug Something isn't working labels Feb 19, 2024
@avishniakov avishniakov requested a review from strickvl February 20, 2024 09:05
Copy link
Contributor

@strickvl strickvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's let the CAB member know whenever it's merged.

@strickvl strickvl changed the title Parallel pipelines can create entites in DB Parallel pipelines can create entities in DB Feb 20, 2024
Copy link
Contributor

@bcdurak bcdurak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from a small nitpick (feel free to ignore), everything looks good!

@avishniakov avishniakov merged commit 1ffe038 into develop Feb 21, 2024
@avishniakov avishniakov deleted the bugfix/OSSK-438-parallel-pipelines-can-fail-to-create-artifacts branch February 21, 2024 13:11
adtygan pushed a commit to adtygan/zenml that referenced this pull request Mar 21, 2024
* fix parallel artifacts registration

* remove excessive warnings

* parallel safe model versions

* increase cool down a bit

* coderabbitai

* coderabbitai

* update test signature

* PR suggestions from Alex

* kudos to windows

* give some more retries for docker CIs

* try to fix test case

* fix parallel tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working internal To filter out internal PRs and issues run-slow-ci
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants