-
Notifications
You must be signed in to change notification settings - Fork 553
Add dynamic shape support to sigmoidbackward #4322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Succeeds when we use dynamic test data but static training data: f82efbc. It outputs:
|
Once I made the test data dynamic (0048f3b), the test failed with error:
which I'm going to look into. |
From milad: |
Yeah, Also, for the error RuntimeError: Function SigmoidBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1] , it failed at https://github.com/pytorch/pytorch/blob/912a1f7b2776c0e7ebf9038e4483a4aa709aa893/torch/csrc/autograd/engine.cpp#L812. Stacktrace: https://gist.github.com/vanbasten23/a68180922e9f4c554b92365c961c21a4 |
Ed: This error is not one I've seen before. At a guess, CLA's bounded symints don't correctly implement the equality/comparison operators and a test AD engine is doing is failing. If you log symint ops should become clear |
so, the way i would debug this is to stick some print statements into this code in engine.cpp you're trying to figure out why
are failing, and it'd be good to understand both (a) how finally, if the 2 copies of shapes looks right, there should be some calls to symint::operator(something) made by |
I did some digging and also added print statement in the XLASymNode's virtual methods by following Will and Ed's suggestion.
It's because when we check the shape compatibility https://github.com/pytorch/pytorch/blob/cdf4a80cc111b210f9ab9448da5aeea2007a0171/torch/csrc/autograd/input_metadata.h#L109, the [edit]: please ignore this point as
So to fix, when we check expandable https://github.com/pytorch/pytorch/blob/cdf4a80cc111b210f9ab9448da5aeea2007a0171/aten/src/ATen/ExpandUtils.h#L507, should we do |
I agree with diagnosis but not the proposed fix. Can we introduce a new operator which denotes the semantic test we want to do here? What exactly is the test we are doing |
Here we face a @ezyang let me know if you have a different idea in mind when referring to "operator" above. |
Pushed this PR for |
this is probably worth breaking out into another issue and getting more specific about the type of conditional and the type of symint. Your options also depend on whether you're factoring in Dynamo or not.
Not sure what you mean here. Are you expecting to avoid these operators in model code, or find a solution that avoids graph-breaking when they are encountered? |
Yup, happy to create a separate issue to carry the discussion.
|
Update: the I'd like to understand why |
I should have said, the inputs to the conditional plays a role... -> if the conditional can be evaluated statically, based on information from faketensors or based on constants, we can keep tracing and then build the assumptions into our hash (lazy) or guards (dynamo).
Yea, we're making changes to PT source all the time in the symbolic-shapes workstream, trying to make it friendlier to tracing. Of course, for certain ops like nonzero there isn't much we can do... But if there is a particular change you want to propose, let us know. |
For the original issue, I'm confused: at https://github.com/pytorch/pytorch/blob/fdc973308bcac5ff3e1c7d91c6d85e5328011653/torch/csrc/autograd/engine.cpp#L807-L814
the
I was able to find the dynamic size of [<=80, 1] is actually [79, 1], so the error seems expected to me because we shouldn't expand a [79, 1] tensor to shape [80, 1]. So my question is: should the |
your link 'here' is broken |
I think it's reasonable for grad to be dynamic size, though you kind of have a problem which is that The way we get around this in core is we maintain precise shape variables so we can tell that "s0" is the same as "s0". IDK if that'll work for XLA though |
Sorry, I fix the link. Correct link is https://github.com/pytorch/pytorch/blob/fdc973308bcac5ff3e1c7d91c6d85e5328011653/torch/csrc/autograd/engine.cpp#L807 |
it's not really clear what it would mean for grad to be dynamic, it's coming from the downstream gradient calculation |
Currently, the way that XLA compares equal two |
OK, then yes, it sounds like you need to propagate dynamism further |
To confirm, here
|
|
Yeah, I'm already looking at how the output shape is generated. xla/torch_xla/csrc/tensor_methods.cpp Lines 2204 to 2208 in 3edc88f
grad_output->shape() and output->shape() gave me f32[<=80,1]{1,0} and f32[<=80,1]{1,0} respectively. So that means the inputs are dynamic.
Digging deeper, my current hypothesis is Edit: oh I found the correct operator* override location. |
Ok, xla/torch_xla/csrc/ops/arithmetic_ir_ops.cpp Lines 39 to 51 in fccbc6a
XlaHelpers::GetPromotedBinaryOpShape seems to determine the output shape. Looking inside xla/torch_xla/csrc/helpers.cpp Lines 481 to 488 in fccbc6a
shape1 and shape2 as static tensor.
|
yea, I guess we need to make |
ok let's do this, instead of using xla/torch_xla/csrc/ops/ops.cpp Lines 259 to 277 in c4bc972
which will reuse the lower function to get the xla:shape of the output |
Oops, didn't see your last message. I modified Your suggestion looks cleaner. Let me try it. Thanks. |
@JackCaoG I tried your approach and it worked: the error goes away and is replaced with a new error. But to confirm, your proposed fix works because |
Update: Recap: for the original SigmoidBackward error What I am trying to do is to create a PR and fix SigmoidBackward error but not the transpose error. I'll create another github issue/pr for the transpose error later. So, right now I'm having some trouble creating a failing test with only SigmoidBackward error but not transpose error: the test will take a dynamic tensor, call SigmoidBackward, then check if the result is also dynamic. I have 2 options: writing python test or c++ test. Python tests: I have 2 python tests that could reproduce the SigmoidBackward error. The problem with these 2 tests is that both test not only reproduce the SigmoidBackward error but also reproduce the transpose error. Removing the C++ test: By following the existing backward test in pytorch/xla, I came up with mine. But it failed with error |
To unblock me, I think I can rely on the existing SigmoidBackward c++ test and merge the fix for SidmoidBackward. Then I'll focus on the next transpose error. After that, I'll add my python test to provide more test coverage. Wdyt? @miladm @JackCaoG |
existing sigmoid test can make sure you don;'t introudce regression to static shape test. As we discussed offline testing |
6865061
to
05b1070
Compare
ddc6f7f
to
bd01f2f
Compare
y_pred = model(x_test) | ||
criterion(y_pred.squeeze(), y_test).item() | ||
xm.mark_step() | ||
print('Test passed.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg. Do you mind if I remove it in the next pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this pr to something along the line of
Add dynamic shape support to sigmoidbackward
to be less confusing.
I messed up a previous branch. So I'll just copy the a few comments here.