-
-
Notifications
You must be signed in to change notification settings - Fork 269
Regression discontinuity example #308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression discontinuity example #308
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
I need to fix the schematic figure. It has a weird blue background for some reason. Something to do with the transparent background. |
Co-authored-by: Oriol Abril-Pla <[email protected]>
View / edit / reply to this conversation on ReviewNB OriolAbril commented on 2022-04-11T21:34:57Z Line #2. plt.tight_layout() remove tight_layout. The arviz-darkgrid theme uses constrained_layout which is more or less equivalent and both can't be used at the same time drbenvincent commented on 2022-04-13T12:01:27Z Done, in upcoming commit |
# Regression discontinuity design analysis | ||
|
||
:::{post} April, 2022 | ||
:tags: regression discontinuity, causal inference, quasi experimental design, counterfactuals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the tags, maybe regression
alone should also be added? I think it would be interesting to people browsing the regression tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
View / edit / reply to this conversation on ReviewNB canyon289 commented on 2022-04-11T23:49:03Z Line #10. .assign(treated=lambda x: x.x > threshold) Nice use of pandas! drbenvincent commented on 2022-04-13T11:59:09Z Thanks |
View / edit / reply to this conversation on ReviewNB canyon289 commented on 2022-04-11T23:49:04Z Line #7. plt.legend() Nit, stick with ax.legend(). Switching between the stateful and object oriented API in mpl can cause issues and given this is a tutorial document I feel like we should show best practice in all aspects drbenvincent commented on 2022-04-13T12:01:08Z Done in upcoming commit |
Thanks View entire conversation on ReviewNB |
Done in upcoming commit View entire conversation on ReviewNB |
Done, in upcoming commit View entire conversation on ReviewNB |
- change notebook tag - remove plt.tight_layout() - ax.legend() -> plt.legend()
That's all the comments addressed so far. Thanks for taking a look @OriolAbril + @canyon289 🙏🏻 |
- all units were exposed to the treatment (orange shaded region). | ||
|
||
```{code-cell} ipython3 | ||
:tags: [hide-input] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would show this cell, but after a couple changes, see next comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
plt.legend(); | ||
``` | ||
|
||
The blue shaded region (which is very narrow) shows the 95% credible region of the expected value of the post-test measurement for a range of possible pre-test measures. This is actually very interesting because it is an example of counterfactual inference. We did not observe any units that were untreated above the threshold. But assuming our model is a good description of reality, we can ask the counterfactual question of "what if a unit above the threshold was not treated?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shows the 95% credible region of the expected value of the post-test measurement for a range of possible pre-test measures
This is not right, the plot is on mu
which is the mean of the measurements, not the actual measurements. Otherwise the model would be way off in calibration terms. When plotting the 95% region of the measurements for the same x
as the observations roughly 95% of the observations should fall inside that region. Note the emphasis becasue this applies x-wise and might not be true for the whole plotted region with ArviZ defaults because of the smoothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is what I was saying by "95% credible region of the expected value of the post-test measurement"
Maybe it would be clearer by just deleting "measurement"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Riight, sorry about that, "expected value" is a synonym of first moment. I always forget this.
Removing measurement might help yes, but I am not sure will be enough. I find this confusing for two reasons.
We are defining a new random variable E(measurement) and giving confidence intervals on that, instead of doing it on the actual measurements which are the terms in which most people think. I am not sure using this new variable instead of y directly helps illustrating the point that comes later, which if I understand correctly is about raw values.
"expected value" doesn't seem like a technical term. I always get it wrong and I think it happens to many other non native speakers as confusion on credible regions of a value and of its mean are common questions I get. Using expectation or expectancy for example might help on that end.
# posterior prediction | ||
with model: | ||
pm.set_data({"x": _x, "treated": _treated}) | ||
ppc = pm.sample_posterior_predictive(idata, var_names=["mu", "y"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note I hope helps with the other comments. I try to distinguish between posterior predictive and pushforward posterior variables. pushforward posterior variables are deterministic transformations of the posterior parameters, posterior predictive are variables that need an extra sampling step.
In this case, mu = x + delta*treated
, once the posterior values are fixed, mu is also fixed. Here the sample_posterior_predictive
is being used as a convenience function to recalculate mu
with the modified data,
but even if sample_...
is called, there is no sampling involved. There is sampling in computing y
because values of y
are draws from the normal distribution of mean mu
and std sigma
. Multiple calls to sample_posterior_predictive
with different seeds will return different values of y
but always the same values for mu
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I have added a bit of text to clarify this. I would not want to change the code to manually calculate mu
for new values of x
and treated
- I think the current code is very clear and convenient as you say. Hopefully this works for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change the code here at all, this was mostly for context
|
||
# plotting | ||
ax = plot_data(df) | ||
_y = ppc.posterior_predictive.mu.mean(dim=["chain", "draw"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these _y
variables are not being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted
|
||
:::{post} April, 2022 | ||
:tags: regression, causal inference, quasi experimental design, counterfactuals | ||
:category: beginner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add explanation type here. there is some code showing how to build this model in pymc, but I think the main goal and content of the notebook is dedicated to answering explanation type questions like "what are regression discontinuities?", "when are they useful?" "what differences are between its results and regular regression ones?"
:category: beginner | |
:category: beginner, explanation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Thanks @OriolAbril. I think I've addressed all your points now. |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-04-20T08:27:43Z Line #10. .assign(treated=lambda x: x.x > threshold) Use drbenvincent commented on 2022-04-21T14:49:05Z Done, in upcoming commit |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-04-20T08:27:43Z I have two comments regarding the first note. The first minor comment is that you say "... post-test ( My second comment is that I find the note hard to understand. The pre-test x and post-test y measures are never the same because drbenvincent commented on 2022-04-21T14:56:09Z Ah, I meant that the measures were the same, not the values. As in, we measured height pre-test and also measured height post-test. I'll make this clearer. |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-04-20T08:27:44Z Line #2. idata = pm.sample(random_seed=123) Use the global |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-04-20T08:27:45Z Line #13. az.plot_hdi(_x, ppc.posterior_predictive["mu"], color="C0", hdi_prob=0.95) To keep in line with what the other reviewers said about not mixing drbenvincent commented on 2022-04-21T15:20:41Z good idea. Turns out it's |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-04-20T08:27:46Z Line #26. az.plot_hdi(_x, ppc.posterior_predictive["mu"], color="C1", hdi_prob=0.95) The same as my previous comment: add the kwargs: |
View / edit / reply to this conversation on ReviewNB lucianopaz commented on 2022-04-20T08:27:46Z Very minor nitpick. You assumed that drbenvincent commented on 2022-04-21T15:42:03Z Good point, I have made this clearer now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drbenvincent, very nice notebook! I left a few comments requesting changes. Nevertheless, I'll approve the PR so that you can merge once you've addressed them.
Done, in upcoming commit View entire conversation on ReviewNB |
Ah, I meant that the measures were the same, not the values. As in, we measured height pre-test and also measured height post-test. I'll make this clearer. View entire conversation on ReviewNB |
good idea. Turns out it's View entire conversation on ReviewNB |
Good point, I have made this clearer now View entire conversation on ReviewNB |
New example notebook! While it is nothing particularly clever in terms of the model, it shows application to quasi-experimental designs which are not covered by any other example notebook.
It is also interesting because it touches on causal inference as well as using PyMC to ask counterfactual questions.
I can see utility in adding a number of future notebooks expanding on issues of analysis of quasi-experimental designs and causal inference. But for now, this can potentially be the first 🙂