fix breaking changes for ONNX Runtime Training #122000

ajindal1 · 2024-03-15T21:50:46Z

Fixes breaking changes for ONNX Runtime Training.

PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);

pytorch-bot · 2024-03-15T21:50:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122000

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 22aef51 with merge base dfc5e93 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh)
test_tensorboard.py::TestTensorBoardSummary::test_hparams_bool

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ajindal1 · 2024-03-15T22:33:20Z

@cyyever @Skylion007 any thoughts on this?

cyyever · 2024-03-16T02:00:16Z

@ajindal1 LGTM, hope that tests pass.

malfet

Why do the need a duplicate implementation rather than do a const_cast in the header
(But also mark them deprecated so that there will be no use of such functions in Torch

malfet · 2024-03-18T18:23:18Z

aten/src/ATen/DLConvertor.cpp

+  auto deleter = [src](void* self) {
+    if (src->deleter) {
+      // NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)
+      src->deleter(const_cast<DLManagedTensor*>(src));
+    }
+  };
+  return fromDLPack(src, std::move(deleter));


Do not copy the code, just cost cast a top level wrapper

Suggested change

auto deleter = [src](void* self) {

if (src->deleter) {

// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)

src->deleter(const_cast<DLManagedTensor*>(src));

}

};

return fromDLPack(src, std::move(deleter));

// NOLINTNEXTLINE(cppcoreguidelines-pro-type-const-cast)

return from DLPack(const_cast<DLManagedTensor*>(src));

malfet · 2024-03-18T18:25:48Z

aten/src/ATen/DLConvertor.h

@@ -13,8 +13,11 @@ namespace at {
 TORCH_API ScalarType toScalarType(const DLDataType& dtype);
 TORCH_API DLManagedTensor* toDLPack(const Tensor& src);
 TORCH_API Tensor fromDLPack(DLManagedTensor* src);
+TORCH_API Tensor fromDLPack(const DLManagedTensor* src);


As those functions are semantically incorrect, let's mark them as deprecated an implement as headers to casts to a const variants

Suggested change

TORCH_API Tensor fromDLPack(const DLManagedTensor* src);

C10_DEPRECATED_MESSAGE("Please migrate to a non-const variant")

inline Tensor fromDLPack(const DLManagedTensor* src) { return fromDLPack(const_cast< DLManagedTensor*>(src); }

Thanks, I have modified just the header file as you suggested, which is minimal changes and makes sense to me. I have started a run on my end to verify it works. Will confirm it by EOD.

cyyever · 2024-03-19T00:15:50Z

@ajindal1 The parameter is const_casted in the function body so that it shouldn't be taken as const. Is it possible to patch aten_op_executor.cc to use the non-const version?

ajindal1 · 2024-03-19T05:39:14Z

@cyyever I don't completely understand your comment, can you please add more information. I am also unable to locate aten_op_executor.cc file you mentioned.

wschin · 2024-03-19T18:46:33Z

@cyyever I don't completely understand your comment, can you please add more information. I am also unable to locate aten_op_executor.cc file you mentioned.

I guess this mean to call const_cast inside aten_op_executor.cc to match whatever the new behavior in PyTorch C++. Not sure if it's doable. On the other hand, I'd prefer PyTorch to maintain BC whenever possible.

ajindal1 · 2024-03-20T01:06:25Z

@cyyever we have also added the const cast change in onnxruntime repository. However, for backward compatibility we will need this change here as well.

cyyever · 2024-03-20T01:11:20Z

@ajindal1 Thank you!

ajindal1 · 2024-03-20T16:57:42Z

@malfet does everything look good to you now?

malfet · 2024-03-21T14:31:57Z

does everything look good to you now?

@ajindal1 if CI is green, yes, but I guess you'll need to add the clang-tidy suppression comment on top of the change, but otherwise LGTM

aten/src/ATen/DLConvertor.h

malfet · 2024-03-21T14:57:34Z

@pytorchbot merge

pytorchmergebot · 2024-03-21T14:59:26Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

malfet · 2024-03-21T17:49:30Z

@pytorchbot merge -i

pytorchmergebot · 2024-03-21T17:51:50Z

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-03-21T18:02:16Z

Merge failed

Reason: 1 jobs have failed, first few of them are: .github/workflows/trunk.yml / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

malfet · 2024-03-21T18:03:03Z

@pytorchbbot merge -if "I've asked to ignore that failure, don't I?"

malfet · 2024-03-21T18:08:41Z

@pytorchbot merge -f "I've asked to ignore that failure, don't I?"

pytorchmergebot · 2024-03-21T18:10:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2024-04-03T17:24:10Z

@pytorchbot cherry-pick

pytorch-bot · 2024-04-03T17:24:13Z

❌ 🤖 pytorchbot command failed:

@pytorchbot cherry-pick: error: the following arguments are required: --onto, -c/--classification

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Try @pytorchbot --help for more info.

malfet · 2024-04-03T17:24:32Z

@pytorchbot cherry-pick --onto release/2.3 --fixes "foobar" -c regression

Fixes breaking changes for ONNX Runtime Training. PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’ TORCH_API Tensor fromDLPack(DLManagedTensor* src); ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: #122000 Approved by: https://github.com/malfet (cherry picked from commit 765c3fc)

pytorchbot · 2024-04-03T17:29:14Z

Cherry picking #122000

The cherry pick PR is at #123271 and it is linked with issue foobar

Details for Dev Infra team

Raised by workflow job

Fixes breaking changes for ONNX Runtime Training. PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’ TORCH_API Tensor fromDLPack(DLManagedTensor* src); ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: #122000 Approved by: https://github.com/malfet (cherry picked from commit 765c3fc) Co-authored-by: Abhishek Jindal <[email protected]>

Fixes breaking changes for ONNX Runtime Training. PR #121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training. Error with current scenario: ``` site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive] at::Tensor tensor = at::fromDLPack(dlpack); site-packages/torch/include/ATen/DLConvertor.h:15:46: note: initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’ TORCH_API Tensor fromDLPack(DLManagedTensor* src); ``` Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: #122000 Approved by: https://github.com/malfet

ajindal1 · 2024-04-22T19:15:43Z

Validated with 2.3.

fix breaking changes for ort trainig

5d5c309

pytorchbot added the open source label Mar 15, 2024

malfet requested changes Mar 18, 2024

View reviewed changes

replace the code by const cast to use the existing code

a5c6897

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 18, 2024

ajindal1 requested a review from malfet March 20, 2024 01:05

malfet approved these changes Mar 21, 2024

View reviewed changes

malfet added this to the 2.3.0 milestone Mar 21, 2024

malfet reviewed Mar 21, 2024

View reviewed changes

aten/src/ATen/DLConvertor.h Outdated Show resolved Hide resolved

Fix lint

22aef51

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 21, 2024

pytorchmergebot added the merging label Mar 21, 2024

pytorchmergebot removed the merging label Mar 21, 2024

malfet added the topic: not user facing topic category label Mar 21, 2024

malfet added topic: bug fixes topic category release notes: cpp release notes category and removed topic: not user facing topic category labels Mar 21, 2024

pytorchmergebot added the merging label Mar 21, 2024

pytorchmergebot removed the merging label Mar 21, 2024

malfet mentioned this pull request Mar 21, 2024

merge -i does not seem to work sometimes #122422

Closed

pytorchmergebot added the merging label Mar 21, 2024

pytorchmergebot closed this in 765c3fc Mar 21, 2024

pytorchmergebot added Merged and removed merging labels Mar 21, 2024

ajindal1 mentioned this pull request Apr 3, 2024

[v.2.3.0] Release Tracker #121760

Closed

atalman mentioned this pull request Apr 10, 2024

Validate cheerry-picks for release 2.3 #123734

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix breaking changes for ONNX Runtime Training #122000

fix breaking changes for ONNX Runtime Training #122000

ajindal1 commented Mar 15, 2024 •

edited

Loading

pytorch-bot bot commented Mar 15, 2024 •

edited

Loading

ajindal1 commented Mar 15, 2024

cyyever commented Mar 16, 2024

malfet left a comment

malfet Mar 18, 2024

malfet Mar 18, 2024

ajindal1 Mar 18, 2024

cyyever commented Mar 19, 2024

ajindal1 commented Mar 19, 2024

wschin commented Mar 19, 2024

ajindal1 commented Mar 20, 2024 •

edited

Loading

cyyever commented Mar 20, 2024

ajindal1 commented Mar 20, 2024

malfet commented Mar 21, 2024 •

edited

Loading

malfet commented Mar 21, 2024

pytorchmergebot commented Mar 21, 2024

malfet commented Mar 21, 2024

pytorchmergebot commented Mar 21, 2024

pytorchmergebot commented Mar 21, 2024

malfet commented Mar 21, 2024

malfet commented Mar 21, 2024

pytorchmergebot commented Mar 21, 2024

malfet commented Apr 3, 2024

pytorch-bot bot commented Apr 3, 2024

malfet commented Apr 3, 2024

pytorchbot commented Apr 3, 2024

ajindal1 commented Apr 22, 2024

	TORCH_API Tensor fromDLPack(const DLManagedTensor* src);
	C10_DEPRECATED_MESSAGE("Please migrate to a non-const variant")
	inline Tensor fromDLPack(const DLManagedTensor* src) { return fromDLPack(const_cast< DLManagedTensor*>(src); }

fix breaking changes for ONNX Runtime Training #122000

fix breaking changes for ONNX Runtime Training #122000

Conversation

ajindal1 commented Mar 15, 2024 • edited Loading

pytorch-bot bot commented Mar 15, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122000

✅ You can merge normally! (1 Unrelated Failure)

ajindal1 commented Mar 15, 2024

cyyever commented Mar 16, 2024

malfet left a comment

Choose a reason for hiding this comment

malfet Mar 18, 2024

Choose a reason for hiding this comment

malfet Mar 18, 2024

Choose a reason for hiding this comment

ajindal1 Mar 18, 2024

Choose a reason for hiding this comment

cyyever commented Mar 19, 2024

ajindal1 commented Mar 19, 2024

wschin commented Mar 19, 2024

ajindal1 commented Mar 20, 2024 • edited Loading

cyyever commented Mar 20, 2024

ajindal1 commented Mar 20, 2024

malfet commented Mar 21, 2024 • edited Loading

malfet commented Mar 21, 2024

pytorchmergebot commented Mar 21, 2024

Merge failed

malfet commented Mar 21, 2024

pytorchmergebot commented Mar 21, 2024

Merge started

pytorchmergebot commented Mar 21, 2024

Merge failed

malfet commented Mar 21, 2024

malfet commented Mar 21, 2024

pytorchmergebot commented Mar 21, 2024

Merge started

malfet commented Apr 3, 2024

pytorch-bot bot commented Apr 3, 2024

malfet commented Apr 3, 2024

pytorchbot commented Apr 3, 2024

Cherry picking #122000

ajindal1 commented Apr 22, 2024

ajindal1 commented Mar 15, 2024 •

edited

Loading

pytorch-bot bot commented Mar 15, 2024 •

edited

Loading

ajindal1 commented Mar 20, 2024 •

edited

Loading

malfet commented Mar 21, 2024 •

edited

Loading