Skip to content

Regression discontinuity example #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Apr 21, 2022

Conversation

drbenvincent
Copy link
Contributor

New example notebook! While it is nothing particularly clever in terms of the model, it shows application to quasi-experimental designs which are not covered by any other example notebook.

It is also interesting because it touches on causal inference as well as using PyMC to ask counterfactual questions.

I can see utility in adding a number of future notebooks expanding on issues of analysis of quasi-experimental designs and causal inference. But for now, this can potentially be the first 🙂

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@drbenvincent
Copy link
Contributor Author

I need to fix the schematic figure. It has a weird blue background for some reason. Something to do with the transparent background.

Benjamin T. Vincent and others added 2 commits April 11, 2022 11:05
@drbenvincent drbenvincent requested a review from OriolAbril April 11, 2022 12:50
@review-notebook-app
Copy link

review-notebook-app bot commented Apr 11, 2022

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2022-04-11T21:34:57Z
----------------------------------------------------------------

Line #2.    plt.tight_layout()

remove tight_layout. The arviz-darkgrid theme uses constrained_layout which is more or less equivalent and both can't be used at the same time


drbenvincent commented on 2022-04-13T12:01:27Z
----------------------------------------------------------------

Done, in upcoming commit

# Regression discontinuity design analysis

:::{post} April, 2022
:tags: regression discontinuity, causal inference, quasi experimental design, counterfactuals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the tags, maybe regression alone should also be added? I think it would be interesting to people browsing the regression tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 11, 2022

View / edit / reply to this conversation on ReviewNB

canyon289 commented on 2022-04-11T23:49:03Z
----------------------------------------------------------------

Line #10.        .assign(treated=lambda x: x.x > threshold)

Nice use of pandas!


drbenvincent commented on 2022-04-13T11:59:09Z
----------------------------------------------------------------

Thanks

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 11, 2022

View / edit / reply to this conversation on ReviewNB

canyon289 commented on 2022-04-11T23:49:04Z
----------------------------------------------------------------

Line #7.        plt.legend()

Nit, stick with ax.legend(). Switching between the stateful and object oriented API in mpl can cause issues and given this is a tutorial document I feel like we should show best practice in all aspects


drbenvincent commented on 2022-04-13T12:01:08Z
----------------------------------------------------------------

Done in upcoming commit

Copy link
Contributor Author

Thanks


View entire conversation on ReviewNB

Copy link
Contributor Author

Done in upcoming commit


View entire conversation on ReviewNB

Copy link
Contributor Author

Done, in upcoming commit


View entire conversation on ReviewNB

- change notebook tag
- remove plt.tight_layout()
- ax.legend() -> plt.legend()
@drbenvincent
Copy link
Contributor Author

That's all the comments addressed so far. Thanks for taking a look @OriolAbril + @canyon289 🙏🏻

- all units were exposed to the treatment (orange shaded region).

```{code-cell} ipython3
:tags: [hide-input]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would show this cell, but after a couple changes, see next comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

plt.legend();
```

The blue shaded region (which is very narrow) shows the 95% credible region of the expected value of the post-test measurement for a range of possible pre-test measures. This is actually very interesting because it is an example of counterfactual inference. We did not observe any units that were untreated above the threshold. But assuming our model is a good description of reality, we can ask the counterfactual question of "what if a unit above the threshold was not treated?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shows the 95% credible region of the expected value of the post-test measurement for a range of possible pre-test measures

This is not right, the plot is on mu which is the mean of the measurements, not the actual measurements. Otherwise the model would be way off in calibration terms. When plotting the 95% region of the measurements for the same x as the observations roughly 95% of the observations should fall inside that region. Note the emphasis becasue this applies x-wise and might not be true for the whole plotted region with ArviZ defaults because of the smoothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is what I was saying by "95% credible region of the expected value of the post-test measurement"

Maybe it would be clearer by just deleting "measurement"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Riight, sorry about that, "expected value" is a synonym of first moment. I always forget this.

Removing measurement might help yes, but I am not sure will be enough. I find this confusing for two reasons.

We are defining a new random variable E(measurement) and giving confidence intervals on that, instead of doing it on the actual measurements which are the terms in which most people think. I am not sure using this new variable instead of y directly helps illustrating the point that comes later, which if I understand correctly is about raw values.

"expected value" doesn't seem like a technical term. I always get it wrong and I think it happens to many other non native speakers as confusion on credible regions of a value and of its mean are common questions I get. Using expectation or expectancy for example might help on that end.

# posterior prediction
with model:
pm.set_data({"x": _x, "treated": _treated})
ppc = pm.sample_posterior_predictive(idata, var_names=["mu", "y"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note I hope helps with the other comments. I try to distinguish between posterior predictive and pushforward posterior variables. pushforward posterior variables are deterministic transformations of the posterior parameters, posterior predictive are variables that need an extra sampling step.

In this case, mu = x + delta*treated, once the posterior values are fixed, mu is also fixed. Here the sample_posterior_predictive is being used as a convenience function to recalculate mu with the modified data,
but even if sample_... is called, there is no sampling involved. There is sampling in computing y because values of y are draws from the normal distribution of mean mu and std sigma. Multiple calls to sample_posterior_predictive with different seeds will return different values of y but always the same values for mu.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I have added a bit of text to clarify this. I would not want to change the code to manually calculate mu for new values of x and treated - I think the current code is very clear and convenient as you say. Hopefully this works for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change the code here at all, this was mostly for context


# plotting
ax = plot_data(df)
_y = ppc.posterior_predictive.mu.mean(dim=["chain", "draw"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these _y variables are not being used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted


:::{post} April, 2022
:tags: regression, causal inference, quasi experimental design, counterfactuals
:category: beginner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add explanation type here. there is some code showing how to build this model in pymc, but I think the main goal and content of the notebook is dedicated to answering explanation type questions like "what are regression discontinuities?", "when are they useful?" "what differences are between its results and regular regression ones?"

Suggested change
:category: beginner
:category: beginner, explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@drbenvincent
Copy link
Contributor Author

Thanks @OriolAbril. I think I've addressed all your points now.

@drbenvincent drbenvincent requested a review from OriolAbril April 17, 2022 15:37
@review-notebook-app
Copy link

review-notebook-app bot commented Apr 20, 2022

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-04-20T08:27:43Z
----------------------------------------------------------------

Line #10.        .assign(treated=lambda x: x.x > threshold)

Use x.x < threshold instead. That way, the simulated data will look more like the schematic figure at the top of the notebook


drbenvincent commented on 2022-04-21T14:49:05Z
----------------------------------------------------------------

Done, in upcoming commit

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 20, 2022

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-04-20T08:27:43Z
----------------------------------------------------------------

I have two comments regarding the first note.

The first minor comment is that you say "... post-test ($y$) measures where", but it should read were instead.

My second comment is that I find the note hard to understand. The pre-test x and post-test y measures are never the same because $y_{i}$ is sampled from a normal distribution. What I think you meant by the note is that in general you could have written $\mu = \alpha + \beta x_{i} + \Delta treated_{i}$, but you chose to fix $\alpha=0$ and $\beta=1$. My suggestion is to change how you phrased this note. Instead of saying that the x and y measures are the same you could say that the expected value of y is taken to be the pre-test measure x, plus some treatment effect.


drbenvincent commented on 2022-04-21T14:56:09Z
----------------------------------------------------------------

Ah, I meant that the measures were the same, not the values. As in, we measured height pre-test and also measured height post-test. I'll make this clearer.

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-04-20T08:27:44Z
----------------------------------------------------------------

Line #2.        idata = pm.sample(random_seed=123)

Use the global RANDOM_SEED instead


@review-notebook-app
Copy link

review-notebook-app bot commented Apr 20, 2022

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-04-20T08:27:45Z
----------------------------------------------------------------

Line #13.    az.plot_hdi(_x, ppc.posterior_predictive["mu"], color="C0", hdi_prob=0.95)

To keep in line with what the other reviewers said about not mixing plt and OOP interfaces, you should pass ax=ax to this call. It would also be nice to make the HDI plot have a label. I believe you can do that by passing backend_kwargs={"label": "$mu$ untreated"}


drbenvincent commented on 2022-04-21T15:20:41Z
----------------------------------------------------------------

good idea. Turns out it's fill_kwargs

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-04-20T08:27:46Z
----------------------------------------------------------------

Line #26.    az.plot_hdi(_x, ppc.posterior_predictive["mu"], color="C1", hdi_prob=0.95)

The same as my previous comment: add the kwargs: ax=ax, backend_kwargs={"label": "$mu$ untreated"}


@review-notebook-app
Copy link

review-notebook-app bot commented Apr 20, 2022

View / edit / reply to this conversation on ReviewNB

lucianopaz commented on 2022-04-20T08:27:46Z
----------------------------------------------------------------

Very minor nitpick. You assumed that mu=x for the untreated subjects, so it makes sense that the HDI is the identity line, and shows no dispersion. Do you want to mention it? I don't really think it's necessary since you've already said this many times during the intro part of the notebook.


drbenvincent commented on 2022-04-21T15:42:03Z
----------------------------------------------------------------

Good point, I have made this clearer now

Copy link
Member

@lucianopaz lucianopaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drbenvincent, very nice notebook! I left a few comments requesting changes. Nevertheless, I'll approve the PR so that you can merge once you've addressed them.

Copy link
Contributor Author

Done, in upcoming commit


View entire conversation on ReviewNB

Copy link
Contributor Author

Ah, I meant that the measures were the same, not the values. As in, we measured height pre-test and also measured height post-test. I'll make this clearer.


View entire conversation on ReviewNB

Copy link
Contributor Author

good idea. Turns out it's fill_kwargs


View entire conversation on ReviewNB

Copy link
Contributor Author

Good point, I have made this clearer now


View entire conversation on ReviewNB

@drbenvincent drbenvincent merged commit 0c66575 into pymc-devs:main Apr 21, 2022
@drbenvincent drbenvincent deleted the regression-discontinuity branch April 21, 2022 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants