Skip to content

Add dynamic shape support to sigmoidbackward #4322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 11, 2023

Conversation

vanbasten23
Copy link
Collaborator

I messed up a previous branch. So I'll just copy the a few comments here.

@vanbasten23
Copy link
Collaborator Author

Succeeds when we use dynamic test data but static training data: f82efbc. It outputs:

Finished training. Got loss: 0.686253547668457
Finished testing, loss= 0.6257358193397522

@vanbasten23
Copy link
Collaborator Author

Once I made the test data dynamic (0048f3b), the test failed with error:

root@t1v-n-2a2b95ef-w-0:/workspaces/work# python3 pytorch/xla/test/test_dynamic_shape_models.py
Traceback (most recent call last):
  File "pytorch/xla/test/test_dynamic_shape_models.py", line 78, in <module>
    train(model, loss_fn=criterion, optimizer=optimizer)
  File "pytorch/xla/test/test_dynamic_shape_models.py", line 65, in train
    loss.backward()
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/_tensor.py", line 484, in backward
    torch.autograd.backward(
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function SigmoidBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1]

which I'm going to look into.

@vanbasten23
Copy link
Collaborator Author

From milad:
Looks like sigmoid_bakwards needs to support dynamism. Wdyt?

@vanbasten23
Copy link
Collaborator Author

Yeah, Also, for the error RuntimeError: Function SigmoidBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1] , it failed at https://github.com/pytorch/pytorch/blob/912a1f7b2776c0e7ebf9038e4483a4aa709aa893/torch/csrc/autograd/engine.cpp#L812. Stacktrace: https://gist.github.com/vanbasten23/a68180922e9f4c554b92365c961c21a4

@vanbasten23
Copy link
Collaborator Author

Milad: We have the stack trace pointing to the pytorch/torch/csrc/autograd/engine.cpp error. See the previous comment. @wconstab @ezyang we are wondering if the autograd engine is missing dynamism support. Wdyt?

@vanbasten23
Copy link
Collaborator Author

Ed: This error is not one I've seen before. At a guess, CLA's bounded symints don't correctly implement the equality/comparison operators and a test AD engine is doing is failing. If you log symint ops should become clear

@wconstab
Copy link
Collaborator

so, the way i would debug this is to stick some print statements into this code in engine.cpp

you're trying to figure out why

    if (!metadata.is_same_shape(grad)) {
      if (metadata.is_expandable_to_shape(grad)) {

are failing, and it'd be good to understand both (a) how is_same_shape works when it gets sym vs non-sym shapes input, and (b) there are shapes stored both in 'metadata' and 'grad' which are expected to match: can you reason about whether one or both is correct, or if one is missing symint?

finally, if the 2 copies of shapes looks right, there should be some calls to symint::operator(something) made by is_same_shape - maybe the XLA version of symint is missing support or incorrect for one of those calls (I think that's what ed was suggesting)

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Dec 14, 2022

I did some digging and also added print statement in the XLASymNode's virtual methods by following Will and Ed's suggestion.

  1. First, I think the error message I got got [80, 1] but expected shape compatible with [<=80, 1] may be incorrect and it should be the other way around: got [<=80, 1] but expected shape compatible with [80, 1].

It's because when we check the shape compatibility https://github.com/pytorch/pytorch/blob/cdf4a80cc111b210f9ab9448da5aeea2007a0171/torch/csrc/autograd/input_metadata.h#L109, the grad.sym_size() is the desired size. But when we construct the error message https://github.com/pytorch/pytorch/blob/cdf4a80cc111b210f9ab9448da5aeea2007a0171/torch/csrc/autograd/input_metadata.h#L128, it treat grad.sym_size() as the actual size.

[edit]: please ignore this point as grad should be the output that we produced (here)

  1. IIUC, the real issue happens in is_expandable_to https://github.com/pytorch/pytorch/blob/cdf4a80cc111b210f9ab9448da5aeea2007a0171/aten/src/ATen/ExpandUtils.h#L496-L512, here I can tell shape is [<=80, 1] and desired is [80, 1], with some print statements. So when it checks the equality of the 0th element https://github.com/pytorch/pytorch/blob/cdf4a80cc111b210f9ab9448da5aeea2007a0171/aten/src/ATen/ExpandUtils.h#L507, <=80 does not equal 80. Coming to pytorch/XLA, when we compare 2 SymNodeImpl, XLASymNodeImpl does implement the eq method:
    c10::SymNode XLASymNodeImpl::eq(const c10::SymNode& other) {
    . Per my understanding, size != target is true as long as they don't have the same dynamic value.

So to fix, when we check expandable https://github.com/pytorch/pytorch/blob/cdf4a80cc111b210f9ab9448da5aeea2007a0171/aten/src/ATen/ExpandUtils.h#L507, should we do size > target && size != 1 instead. WDYT? @wconstab @ezyang

@ezyang
Copy link
Collaborator

ezyang commented Dec 14, 2022

I agree with diagnosis but not the proposed fix. Can we introduce a new operator which denotes the semantic test we want to do here? What exactly is the test we are doing

@miladm
Copy link
Collaborator

miladm commented Dec 15, 2022

Here we face a ne comparison operator. I don't see an implementation for XLASymNodeImpl::ne in dynamic_ir. At the same time, if ne were to be the source of failure, I would've expected an error message from this code block. Regardless, it's relevant we implement ne in XLA as a stepping stone.

@ezyang let me know if you have a different idea in mind when referring to "operator" above.

@miladm
Copy link
Collaborator

miladm commented Dec 15, 2022

Pushed this PR for ne: #4338. Waiting for CI tests to proceed.
(master head was red. It must now be unblocked.)

@miladm miladm added the dynamism Dynamic Shape Features label Dec 15, 2022
@miladm miladm self-requested a review December 15, 2022 16:24
@miladm
Copy link
Collaborator

miladm commented Dec 15, 2022

On a slightly tangential topic: DS breaks the graph at conditional operations. The expectation is not to have conditional ops on the critical path of the compiler (FWD+BKWD). What are your thoughts on finding a solution to avoid running into such scenarios in the compiler?

@wconstab @ezyang

@wconstab
Copy link
Collaborator

DS breaks the graph at conditional operations

this is probably worth breaking out into another issue and getting more specific about the type of conditional and the type of symint. Your options also depend on whether you're factoring in Dynamo or not.

The expectation is not to have conditional ops on the critical path of the compiler (FWD+BKWD)

Not sure what you mean here. Are you expecting to avoid these operators in model code, or find a solution that avoids graph-breaking when they are encountered?

@miladm
Copy link
Collaborator

miladm commented Dec 15, 2022

Yup, happy to create a separate issue to carry the discussion.

this is probably worth breaking out into another issue and getting more specific about the type of conditional and the type of symint. Your options also depend on whether you're factoring in Dynamo or not.

  • IIUC, dynamo may reduce the number of conditionals on a given code path. Though, when any conditional is passed down to downstream (either from Dynamo or not), we need to trigger a graph break. Correct me if I am wrong.
  • Can you elaborate how you think the 'type' of conditional plays a role in deciding whether or not a graph break occurs?

Not sure what you mean here. Are you expecting to avoid these operators in model code, or find a solution that avoids graph-breaking when they are encountered?

  • re: model code, the user is responsible for avoiding conditionals in their implementation. The fewer conditionals they introduce, the more optimal their code will be.
  • I am looking to understand if there is a way for us to avoid having PT source code cause a graph breaks (upon conditionals) on commonly executed ops.

@miladm
Copy link
Collaborator

miladm commented Dec 15, 2022

Update: the ne PR is good to have, but doesn't address this issue (reference).

I'd like to understand why metadata is not dynamic. @vanbasten23 can you please take a look?

@wconstab
Copy link
Collaborator

Can you elaborate how you think the 'type' of conditional plays a role in deciding whether or not a graph break occurs?

I should have said, the inputs to the conditional plays a role... -> if the conditional can be evaluated statically, based on information from faketensors or based on constants, we can keep tracing and then build the assumptions into our hash (lazy) or guards (dynamo).

I am looking to understand if there is a way for us to avoid having PT source code cause a graph breaks (upon conditionals) on commonly executed ops.

Yea, we're making changes to PT source all the time in the symbolic-shapes workstream, trying to make it friendlier to tracing. Of course, for certain ops like nonzero there isn't much we can do... But if there is a particular change you want to propose, let us know.

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Dec 15, 2022

For the original issue, I'm confused: at https://github.com/pytorch/pytorch/blob/fdc973308bcac5ff3e1c7d91c6d85e5328011653/torch/csrc/autograd/engine.cpp#L807-L814

 if (!metadata.is_same_shape(grad)) {
      if (metadata.is_expandable_to_shape(grad)) {
        grad = metadata.reduce_grad(grad);
      } else {
        const auto message = metadata.incompatible_shape_error_message(i, grad);
        AT_ERROR(format_error(message.str()));
      }
}

the metadata has shape [<=80, 1], grad has shape [80, 1], here the code says we cannot expand a [<=80, 1] tensor to [80, 1] tensor. With some print statement, specifically

file=/workspaces/work/pytorch/aten/src/ATen/ExpandUtils.h, line=511function=is_expandable_to: i=1, size=<=80, target=80
file=torch_xla/csrc/tensor.cpp, line=665function=eq: 
file=torch_xla/csrc/tensor.cpp, line=757function=bool_: 
xfile=torch_xla/csrc/ops/dynamic_ir.cpp, line=113function=getDynamicValue: dim_node_0->getDynamicValue()=79, dim_node_1->getDynamicValue()=80

I was able to find the dynamic size of [<=80, 1] is actually [79, 1], so the error seems expected to me because we shouldn't expand a [79, 1] tensor to shape [80, 1].

So my question is: should the grad here be a dynamic size instead of static size? @ezyang @wconstab

@ezyang
Copy link
Collaborator

ezyang commented Dec 16, 2022

your link 'here' is broken

@ezyang
Copy link
Collaborator

ezyang commented Dec 16, 2022

I think it's reasonable for grad to be dynamic size, though you kind of have a problem which is that <=79 shouldn't necessarily compare equal to <=79; e.g., 68 and 70 would be valid concretizations of these types but they're not equal.

The way we get around this in core is we maintain precise shape variables so we can tell that "s0" is the same as "s0". IDK if that'll work for XLA though

@vanbasten23
Copy link
Collaborator Author

@ezyang
Copy link
Collaborator

ezyang commented Dec 16, 2022

it's not really clear what it would mean for grad to be dynamic, it's coming from the downstream gradient calculation

@vanbasten23
Copy link
Collaborator Author

I think it's reasonable for grad to be dynamic size, though you kind of have a problem which is that <=79 shouldn't necessarily compare equal to <=79; e.g., 68 and 70 would be valid concretizations of these types but they're not equal.

The way we get around this in core is we maintain precise shape variables so we can tell that "s0" is the same as "s0". IDK if that'll work for XLA though

Currently, the way that XLA compares equal two <=79 is by checking the real size (dynamic size) here. In the current case, if grad happens to be [<=80, 1] and has dynamic size of 79, then metadata(has shape [<=80, 1] with dynamic size [79, 1]) should be expandable to it. So I wonder if dynamism propagate correctly to grad. IOW, I wonder if grad should be SymInts [<=80, 1] instead of static shape [80, 1].

@ezyang
Copy link
Collaborator

ezyang commented Dec 20, 2022

OK, then yes, it sounds like you need to propagate dynamism further

@vanbasten23
Copy link
Collaborator Author

To confirm, here

if (!metadata.is_same_shape(grad)) {

grad is the output which has size [80, 1]. metadata (ie the input metadata, has size [<=80, 1]) is the expectation against which we are validating the output. Do you mean the expectation ([<=80, 1]) is correct and we need to propagate the dynamism to the output grad so that its size become [<=80, 1]?

@JackCaoG
Copy link
Collaborator

SigmoidBackward is also a special case where we don't have real lowering but reuse a bunch of IR level computation, I wonder if that has any implication to outputshape lost the dynamism(if that's the case). Anyway, print everything here you should have some ideas.

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Dec 21, 2022

Yeah, I'm already looking at how the output shape is generated.

XLATensorPtr sigmoid_backward(const XLATensorPtr& grad_output,
const XLATensorPtr& output) {
return grad_output->CreateFrom(
SigmoidBackward(grad_output->GetIrValue(), output->GetIrValue()));
}
.
grad_output->shape() and output->shape() gave me f32[<=80,1]{1,0} and f32[<=80,1]{1,0} respectively. So that means the inputs are dynamic.

Digging deeper, my current hypothesis is return grad_output * (ScalarOp(1, GetXlaShape(output)) - output) * output would invoke https://github.com/pytorch/pytorch/blob/8b617f813d86c348be368a72170ab0d319308b23/torch/csrc/lazy/core/ops/arithmetic_ir_ops.cpp#L31-L36 and maybe we need to make GetPromotedBinaryOpShape dynamic. But once I added some print statement in NodePtr operator*(const Value& node1, const Value& node2), I couldn't see my print statements.

Edit: oh I found the correct operator* override location.

@vanbasten23
Copy link
Collaborator Author

Ok, return grad_output * (ScalarOp(1, GetXlaShape(output)) - output) * output would invoke

torch::lazy::NodePtr operator*(const torch::lazy::Value& node1,
const torch::lazy::Value& node2) {
auto lower_fn = [](const XlaNode& node,
LoweringContext* loctx) -> XlaOpVector {
xla::XlaOp op0 = loctx->GetOutputOp(node.operand(0));
xla::XlaOp op1 = loctx->GetOutputOp(node.operand(1));
return node.ReturnOp(XlaHelpers::PromotedMul(op0, op1), loctx);
};
return GenericOp(torch::lazy::OpKind(at::aten::mul), {node1, node2},
XlaHelpers::GetPromotedBinaryOpShape(GetXlaShape(node1),
GetXlaShape(node2)),
std::move(lower_fn));
}
. The XlaHelpers::GetPromotedBinaryOpShape seems to determine the output shape. Looking inside
xla::Shape XlaHelpers::GetPromotedBinaryOpShape(const xla::Shape& shape1,
const xla::Shape& shape2) {
return xla::ShapeUtil::MakeShape(
PromoteType(shape1.element_type(), shape2.element_type()),
torch::lazy::GetPromotedShape(
xla::util::ToVector<int64_t>(shape1.dimensions()),
xla::util::ToVector<int64_t>(shape2.dimensions())));
}
, it looks we are treating both shape1 and shape2 as static tensor.

@JackCaoG
Copy link
Collaborator

yea, I guess we need to make GetPromotedBinaryOpShape understand dynamism.

@JackCaoG
Copy link
Collaborator

ok let's do this, instead of using GetPromotedBinaryOpShape, let's do something similar to

torch::lazy::NodePtr Dot(const torch::lazy::Value& input,
const torch::lazy::Value& weight) {
auto lower_fn = [](const XlaNode& node,
LoweringContext* loctx) -> XlaOpVector {
xla::XlaOp xla_input = loctx->GetOutputOp(node.operand(0));
xla::XlaOp xla_weight = loctx->GetOutputOp(node.operand(1));
return node.ReturnOp(BuildDot(xla_input, xla_weight), loctx);
};
auto lower_for_shape_fn =
[](absl::Span<const xla::XlaOp> operands) -> xla::XlaOp {
return BuildDot(operands[0], operands[1]);
};
return GenericOp(torch::lazy::OpKind(at::aten::mm), {input, weight},
[&]() {
return InferOutputShape(
{GetXlaShape(input), GetXlaShape(weight)},
lower_for_shape_fn);
},
std::move(lower_fn));

which will reuse the lower function to get the xla:shape of the output

@vanbasten23
Copy link
Collaborator Author

Oops, didn't see your last message. I modified GetPromotedBinaryOpShape as in this naive pytorch pr and changed how pt/xla uses it in torch_xla/csrc/helpers.cpp (I uploaded the change in this pr ) and the script passed the failure.

Your suggestion looks cleaner. Let me try it. Thanks.

@vanbasten23
Copy link
Collaborator Author

@JackCaoG I tried your approach and it worked: the error goes away and is replaced with a new error. But to confirm, your proposed fix works because InferOutputShape uses XlaHelpers::ShapeOfXlaOp(result) internally and XlaHelpers::ShapeOfXlaOp(result) will take dynamism into consideration, right?

@vanbasten23
Copy link
Collaborator Author

Update:

Recap: for the original SigmoidBackward error RuntimeError: Function SigmoidBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1], I tried Jack's suggestion and I think it fixed the SigmoidBackward error and we are now having another transpose error.

What I am trying to do is to create a PR and fix SigmoidBackward error but not the transpose error. I'll create another github issue/pr for the transpose error later. So, right now I'm having some trouble creating a failing test with only SigmoidBackward error but not transpose error: the test will take a dynamic tensor, call SigmoidBackward, then check if the result is also dynamic.

I have 2 options: writing python test or c++ test.

Python tests: I have 2 python tests that could reproduce the SigmoidBackward error. The problem with these 2 tests is that both test not only reproduce the SigmoidBackward error but also reproduce the transpose error. Removing the torch.nn.Linear layer didn't help (test) and it failed with error RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.

C++ test: By following the existing backward test in pytorch/xla, I came up with mine. But it failed with error C++ exception with description "element 0 of tensors does not require grad and does not have a grad_fn. I have no luck fixing it so far. If I remove the nonzero, I don't get the error.

@vanbasten23
Copy link
Collaborator Author

To unblock me, I think I can rely on the existing SigmoidBackward c++ test and merge the fix for SidmoidBackward. Then I'll focus on the next transpose error. After that, I'll add my python test to provide more test coverage. Wdyt? @miladm @JackCaoG

@JackCaoG
Copy link
Collaborator

JackCaoG commented Jan 7, 2023

existing sigmoid test can make sure you don;'t introudce regression to static shape test. As we discussed offline testing nonzero + sigmoid on C++ test has some issues. I think it is OK to skip the unit test this time but adding a linear model tests that would fail without this fix(after you fixed other linear model failures).

@vanbasten23 vanbasten23 requested a review from JackCaoG January 9, 2023 23:15
@vanbasten23 vanbasten23 marked this pull request as ready for review January 9, 2023 23:16
@vanbasten23 vanbasten23 force-pushed the testBackwardPassModelDSTake2 branch from 6865061 to 05b1070 Compare January 10, 2023 05:17
@vanbasten23 vanbasten23 force-pushed the testBackwardPassModelDSTake2 branch from ddc6f7f to bd01f2f Compare January 10, 2023 18:00
y_pred = model(x_test)
criterion(y_pred.squeeze(), y_test).item()
xm.mark_step()
print('Test passed.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg. Do you mind if I remove it in the next pr?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is ok

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this pr to something along the line of

Add dynamic shape support to sigmoidbackward

to be less confusing.

@vanbasten23 vanbasten23 changed the title Test backward pass nn model with dynamic input Add dynamic shape support to sigmoidbackward Jan 11, 2023
@vanbasten23 vanbasten23 merged commit f1f9080 into master Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dynamism Dynamic Shape Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants