-
Notifications
You must be signed in to change notification settings - Fork 24k
fix breaking changes for ONNX Runtime Training #122000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix breaking changes for ONNX Runtime Training #122000
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122000
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 22aef51 with merge base dfc5e93 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@cyyever @Skylion007 any thoughts on this? |
@ajindal1 LGTM, hope that tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do the need a duplicate implementation rather than do a const_cast
in the header
(But also mark them deprecated so that there will be no use of such functions in Torch
aten/src/ATen/DLConvertor.cpp
Outdated
auto deleter = [src](void* self) { | ||
if (src->deleter) { | ||
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast) | ||
src->deleter(const_cast<DLManagedTensor*>(src)); | ||
} | ||
}; | ||
return fromDLPack(src, std::move(deleter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not copy the code, just cost cast a top level wrapper
auto deleter = [src](void* self) { | |
if (src->deleter) { | |
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast) | |
src->deleter(const_cast<DLManagedTensor*>(src)); | |
} | |
}; | |
return fromDLPack(src, std::move(deleter)); | |
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast) | |
return from DLPack(const_cast<DLManagedTensor*>(src)); |
aten/src/ATen/DLConvertor.h
Outdated
@@ -13,8 +13,11 @@ namespace at { | |||
TORCH_API ScalarType toScalarType(const DLDataType& dtype); | |||
TORCH_API DLManagedTensor* toDLPack(const Tensor& src); | |||
TORCH_API Tensor fromDLPack(DLManagedTensor* src); | |||
TORCH_API Tensor fromDLPack(const DLManagedTensor* src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As those functions are semantically incorrect, let's mark them as deprecated an implement as headers to casts to a const variants
TORCH_API Tensor fromDLPack(const DLManagedTensor* src); | |
C10_DEPRECATED_MESSAGE("Please migrate to a non-const variant") | |
inline Tensor fromDLPack(const DLManagedTensor* src) { return fromDLPack(const_cast< DLManagedTensor*>(src); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have modified just the header file as you suggested, which is minimal changes and makes sense to me. I have started a run on my end to verify it works. Will confirm it by EOD.
@ajindal1 The parameter is const_casted in the function body so that it shouldn't be taken as const. Is it possible to patch aten_op_executor.cc to use the non-const version? |
@cyyever I don't completely understand your comment, can you please add more information. I am also unable to locate aten_op_executor.cc file you mentioned. |
I guess this mean to call |
@cyyever we have also added the const cast change in onnxruntime repository. However, for backward compatibility we will need this change here as well. |
@ajindal1 Thank you! |
@malfet does everything look good to you now? |
@ajindal1 if CI is green, yes, but I guess you'll need to add the clang-tidy suppression comment on top of the change, but otherwise LGTM |
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: .github/workflows/trunk.yml / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable) Details for Dev Infra teamRaised by workflow job |
@pytorchbbot merge -if "I've asked to ignore that failure, don't I?" |
@pytorchbot merge -f "I've asked to ignore that failure, don't I?" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot cherry-pick |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot cherry-pick --onto release/2.3 --fixes "foobar" -c regression |
Fixes breaking changes for ONNX Runtime Training. PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’ TORCH_API Tensor fromDLPack(DLManagedTensor* src); ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: #122000 Approved by: https://github.com/malfet (cherry picked from commit 765c3fc)
Cherry picking #122000The cherry pick PR is at #123271 and it is linked with issue foobar Details for Dev Infra teamRaised by workflow job |
Fixes breaking changes for ONNX Runtime Training. PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’ TORCH_API Tensor fromDLPack(DLManagedTensor* src); ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: #122000 Approved by: https://github.com/malfet (cherry picked from commit 765c3fc) Co-authored-by: Abhishek Jindal <[email protected]>
Fixes breaking changes for ONNX Runtime Training. PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’ TORCH_API Tensor fromDLPack(DLManagedTensor* src); ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: #122000 Approved by: https://github.com/malfet
Validated with 2.3. |
Fixes breaking changes for ONNX Runtime Training.
PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.
Error with current scenario: