Skip to content

Fix dim lengths created from coords or ConstantData #5751 #5763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 21, 2022

Conversation

LukeLB
Copy link
Contributor

@LukeLB LukeLB commented May 14, 2022

Closes #5751. Fixes dimensions created by add_coord by making dimension length a TensorConstant. I have also added a test for this using some code from #5181.

I'm setting as WIP as @canyon289 suggested this PR is required for other work to become unblocked and I'm having problems with some failing tests (which could just be due to me running things locally with an incorrect set-up?). So it's worth seeing if they fail on here and then look at fixes, which will be faster than me trying to figure things out on my own!

Depending on what your PR does, here are a few things you might want to address in the description:

  • what are the (breaking) changes that this PR makes?
  • important background, or details about the implementation
  • are the changes—especially new features—covered by tests and docstrings?
  • consider adding/updating relevant example notebooks
  • right before it's ready to merge, mention the PR in the RELEASE-NOTES.md

@canyon289
Copy link
Member

it's worth seeing if they fail on here and then look at fixes, which will be faster than me trying to figure things out on my own

Great strategy. Thanks for taking such a principled approach to development

It seems that there type being changed somewhere. Might be as simple as updating this test.
https://github.com/pymc-devs/pymc/runs/6435716258?check_suite_focus=true#step:7:80

@codecov
Copy link

codecov bot commented May 14, 2022

Codecov Report

Merging #5763 (e97ad4e) into main (00d4372) will increase coverage by 0.00%.
The diff coverage is 93.10%.

❗ Current head e97ad4e differs from pull request most recent head abad0a4. Consider uploading reports for the commit abad0a4 to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #5763   +/-   ##
=======================================
  Coverage   89.27%   89.28%           
=======================================
  Files          74       74           
  Lines       13811    13822   +11     
=======================================
+ Hits        12330    12341   +11     
  Misses       1481     1481           
Impacted Files Coverage Δ
pymc/data.py 81.63% <87.50%> (-1.99%) ⬇️
pymc/model.py 86.78% <100.00%> (+0.40%) ⬆️
pymc/sampling.py 88.42% <100.00%> (+0.05%) ⬆️
pymc/step_methods/hmc/base_hmc.py 89.68% <0.00%> (-0.80%) ⬇️
pymc/step_methods/hmc/nuts.py 97.50% <0.00%> (+2.50%) ⬆️

@LukeLB
Copy link
Contributor Author

LukeLB commented May 15, 2022

Failing test looks to be down to sampling error?

test_sample_posterior_predictive_after_set_data_with_coord
E       AssertionError: 
E       Not equal to tolerance rtol=1e-07, atol=0.1
E       
E       Mismatched elements: 2 / 2 (100%)
E       Max absolute difference: 0.13669218
E       Max relative difference: 0.02331315
E        x: array([5, 6])
E        y: array([4.89599 , 5.863308]) 

And it's highlighted in an outstanding issue #5739.

@LukeLB
Copy link
Contributor Author

LukeLB commented May 16, 2022

This test is failing because y is not the correct shape, y.eval().shape == (1, 1). Both feature and group are TensorConstants so the addition to add_coord is doing as intented with respect to type changes. I'm not sure I understand why this is still happening.....

def test_add_coord_fixed_dims():
    """
    test that dims does not cause resizing of y. See #5181.
    """
    with pm.Model(coords={"feature": [1], "group": ["A", "B"]}):
        x = pm.Normal("x", 0, 1, dims="feature")
        y = pm.Normal("y", x[..., None], 1, dims=("feature", "group"))
    assert x.eval().shape == (1,)
    assert y.eval().shape == (1, 2)

@michaelosthege
Copy link
Member

Maybe do more asserts?

with pm.Model(coords={"feature": [1], "group": ["A", "B"]}):
    assert pmodel.dim_lengths["feature"].eval() == 1
    assert pmodel.dim_lengths["group"].eval() == 2

@LukeLB
Copy link
Contributor Author

LukeLB commented May 16, 2022

My thinking for that test was to make sure that size and dims can be passed at the same time and we get the expected shape out which will not have been possible before this type change in add_coord(). So maybe its worth having two tests, the one you've suggested, and the other we can use the case you highlight #5181 (comment) to test that size and dims can be passed at the same time. What do you think?

@michaelosthege
Copy link
Member

michaelosthege commented May 16, 2022

When size/shape and dims are passed at the same time, the size/shape takes priority and dims are just tracked for labeling the dimensions in the final xarray objects.

That's why I suggested to only pass dims, and actually it's a little concerning that the test doesn't pass.

Maybe call add_coord directly while passing fixed. For the variables depending on it there should be no difference between aesara.shared() or at.constant() dim lengths.

with pm.Model() as pmodel:
    pmodel.add_coord("feature", [1], fixed=True/False)
    pmodel.add_coord("group", ["A", "B"], fixed=True/False)
    assert pmodel.dim_lengths["feature"].eval() == 1
    assert pmodel.dim_lengths["group"].eval() == 2
    x = pm.Normal("x", 0, 1, dims="feature")
    y = pm.Normal("y", x[..., None], 1, dims=("feature", "group"))
assert x.eval().shape == (1,)
assert y.eval().shape == (1, 2)

@LukeLB
Copy link
Contributor Author

LukeLB commented May 16, 2022

When I try that all cases of fixed = True/False lead to y.eval().shape == (1, 1).

@michaelosthege
Copy link
Member

Oh, now I see why. The x[..., None] creates the RV as (1, 1) and because the dimensionality is as expected, there's no further resizing.

This works 👇 because then the RV is (1,) and repeated for the "group" dimension.

    # ...
    x = pm.Normal("x", 0, 1, dims="feature")
    y = pm.Normal("y", x, 1, dims=("group", "feature"))
assert x.eval().shape == (1,)
assert y.eval().shape == (2, 1)

@LukeLB
Copy link
Contributor Author

LukeLB commented May 16, 2022

I think I'm still misunderstanding something here. Wouldn't passing dims like that still be allowed prior to this PR? So I don't understand what it is testing with respect to this PR.

@michaelosthege
Copy link
Member

Yes, that's already allowed, but where this test comes in is in combination with the changes from #5761.

Actually, it's a little unfortunate that these changes are on separate branches.
Your test case should (latest after adding a pm.set_data) close the coverage gaps from that other PR.

I'm inclined to merge #5761 even with those coverage gaps (I'll review it really carefully right now).
After that you could just git rebase origin/main, add two lines doing pm.set_data and assert the ShapeError or ShapeWarning and then we should be good to go 🤔

michaelosthege added a commit that referenced this pull request May 16, 2022
* Support TensorConstant entries in `Model.dim_lengths`.
* Raise errors on attempts to resize dims with `TensorConstant` length.
* Raise errors on attempts to resize dims that are linked to non-shared variables.
* Warn about resizing dims that weren't initialized from a shared variable (supposedly via `add_coord(..., fixed=False)`, see #5763).
* Raise errors when attempting to resize a dim that had coord values without providing new coord values.

Closes #5760 by anticipating that not all symbolic dim lengths originate from RVs.

Co-authored-by: Michael Osthege <[email protected]>
@michaelosthege
Copy link
Member

@LukeLB the other PR was merged, so now you should be able to rebase and bring the coverage back up :)

@LukeLB
Copy link
Contributor Author

LukeLB commented May 16, 2022

@michaelosthege OK I think I understand. I'll rebase and commit your suggested changes.

Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test cases for the ShapeWarning and ValueError?

With match="" one can make sure that the exception is the correct one. That's important because there are two different ShapeErrors that one could run into..

We should aim to cover all code paths from the diff here https://github.com/pymc-devs/pymc/pull/5761/files

@ricardoV94 ricardoV94 changed the title [WIP] Fix dim lengths created from coords or ConstantData #5751 Fix dim lengths created from coords or ConstantData #5751 May 18, 2022
@ricardoV94 ricardoV94 added this to the v4.0.0 milestone May 18, 2022
Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't make the change suggestion, but the whole block containing the ShapeWarning can be removed..

Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LukeLB !

This brings the coverage back up should lead to fewer shape problems long term.

Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah damn, I just realized that we forgot to actually use the new fixed kwarg.

I commented a change in one of the tests that should currently fail.
Let me know if you want me to take over - you put in a lot of effort already and I don't want to get demotivated by all this back and forth

@LukeLB
Copy link
Contributor Author

LukeLB commented May 18, 2022

ah damn, I just realized that we forgot to actually use the new fixed kwarg.

I commented a change in one of the tests that should currently fail. Let me know if you want me to take over - you put in a lot of effort already and I don't want to get demotivated by all this back and forth

Not at all! The longer it goes the more I learn :) Thanks for the offer though.

@LukeLB
Copy link
Contributor Author

LukeLB commented May 18, 2022

Why does the assert statement on line 758 fail? The ConstantData() function suggests that it returns a TensorConstant.

@michaelosthege
Copy link
Member

michaelosthege commented May 18, 2022

Why does the assert statement on line 759 fail? Its created from coord so should be a TensorConstant but is Subtensor{int64}.0.

The Subtensor{int64}.0 is because from pm.ConstantData([1,2,3], dims="foo") it's effectively doing this: dim_lengths["foo"] = at.constant([1,2,3]).shape[0].

So what needs to be done is a change in the logic behind pm.ConstantData (or rather pm.Data) to determine the constant data's shape non-symbolically first, and then call add_coord(..., fixed=True).

And I should add that pm.MutableData([1,2,3], dims="bla") becomes: dim_lengths["bla"] = aesara.shared([1,2,3]).shape[0].

@LukeLB
Copy link
Contributor Author

LukeLB commented May 19, 2022

OK so I think I have a solution which I have pushed but I'm running into a wall with a failing test which I can't seem to fix. Unfortunatley I won't be able to work on this till after the weekend so if there is a time pressure on this then it will be worth someone else picking it up.

@michaelosthege
Copy link
Member

Thanks @LukeLB, I will try to pick it up in the evening

@canyon289
Copy link
Member

@LukeLB this weekend is fine! You've made great progress on this and don't want to take away joy of getting it merged!

@michaelosthege
Copy link
Member

michaelosthege commented May 21, 2022

I rebased the branch and squashed the history into three logically separate commits:

  1. ✔ Removing the ShapeWarning in favor of raising earlier
  2. ✔ Adding optional coords to the pm.Data API, because we were taking coordinate values from the data variable, which does not work for >1 dimensions.
  3. ✔ Addition of the fixed kwarg (the actual goal of this PR)

I will force-push them one by one so we can confirm that the first two commits don't create new problems.

Needed to create resizable dims with coordinate values,
because coords passed to the model will become immutable dims.
See pymc-devs#5763.

Also, because `pm.Data` variables can be N-dimensional,
the coordinate values can't reliably be taken from the value.
@michaelosthege michaelosthege force-pushed the issue_#5751 branch 2 times, most recently from 2c37381 to 79fdfdd Compare May 21, 2022 14:16
@michaelosthege
Copy link
Member

@LukeLB @canyon289 @ricardoV94 please review and note the new changes I made in the pm.Data API

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, left a small question and a suggestion about using the same keyword internally (fixed vs mutable)

if not mutable:
# Use the dimension lengths from the before it was tensorified.
# These can still be tensors, but in many cases they are numeric.
xshape = np.shape(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter much but x.type.shape should have the same info

This increases the safety of working with resizable dimensions.
Dims created via `pm.Model(coords=...)` coordinates are now immutable.
Only dimension lengths that are created anew from `pm.MutableData`,
`pm.Data(mutable=True)`, or the underlying `add_coord(mutable=True)`
will become shared variables.

Co-authored-by: Michael Osthege <[email protected]>
Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel 👍 about merging as long as the tests go ✔.

@canyon289
Copy link
Member

Me too!

@michaelosthege michaelosthege merged commit a95dd71 into pymc-devs:main May 21, 2022
michaelosthege added a commit that referenced this pull request May 21, 2022
Needed to create resizable dims with coordinate values,
because coords passed to the model will become immutable dims.
See #5763.

Also, because `pm.Data` variables can be N-dimensional,
the coordinate values can't reliably be taken from the value.
@canyon289
Copy link
Member

canyon289 commented May 21, 2022

Congrats @LukeLB! Thank you for leading this PR

@LukeLB
Copy link
Contributor Author

LukeLB commented May 22, 2022

@canyon289 @michaelosthege @ricardoV94 thanks all!! It's been a pleasure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix dim lengths created from coords or ConstantData
4 participants