Skip to content

Add support for subsetting ModelSpec and ModelSpecs instances.#208

Merged
matthewwardrop merged 2 commits intomainfrom
add_model_spec_subset
Nov 28, 2024
Merged

Add support for subsetting ModelSpec and ModelSpecs instances.#208
matthewwardrop merged 2 commits intomainfrom
add_model_spec_subset

Conversation

@matthewwardrop
Copy link
Copy Markdown
Owner

@matthewwardrop matthewwardrop commented Nov 19, 2024

This implements .subset for ModelSpec instances, along the lines of DesignInfo.subset. Because I like doing things over-the-top, I also implemented this for ModelSpecs instances as well, allowing one-step subsetting of nested structure.

e.g.

import numpy as np
import pandas as pd
from formulaic import model_matrix

n = 25
df = pd.DataFrame(
    {
        "y": np.random.standard_normal(n),
        "x": np.random.standard_normal(n),
        "z": np.random.standard_normal(n),
        "c": pd.Series(np.random.choice(["a", "b", "d"], size=n),dtype="category"),
    }
)

fmla = "y ~ 1 + x + c + z"
mm = model_matrix(fmla, df)

mm.rhs.model_spec.subset("x")

outputs:

# ModelSpec(formula=1 + x, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=x, scoped_terms=[x], columns=['x'])], transform_state={}, encoder_state={'x': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['a', 'b', 'd'], 'contrasts': ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['a', 'b', 'd'])}), 'z': (<Kind.NUMERICAL: 'numerical'>, {})})

mm.model_spec.subset("y ~ x")

outputs:

.lhs:
    ModelSpec(formula=y, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=y, scoped_terms=[y], columns=['y'])], transform_state={}, encoder_state={'y': (<Kind.NUMERICAL: 'numerical'>, {})})
.rhs:
    ModelSpec(formula=1 + x, materializer='pandas', materializer_params={}, ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='pandas', cluster_by=<ClusterBy.NONE: 'none'>, structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=x, scoped_terms=[x], columns=['x'])], transform_state={}, encoder_state={'x': (<Kind.NUMERICAL: 'numerical'>, {}), 'c': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['a', 'b', 'd'], 'contrasts': ContrastsState(contrasts=TreatmentContrasts(base=UNSET), levels=['a', 'b', 'd'])}), 'z': (<Kind.NUMERICAL: 'numerical'>, {})})

@bashtage Would be great to see if this works for you.

TODO:

  • Unit tests
  • Documentation

closes: #206

@bashtage
Copy link
Copy Markdown
Contributor

Very helpful. I think between this and the parser changes we are very close. The only two obvious issues are pickling, which I have a PR for in #209, and a look at variable order.

@matthewwardrop
Copy link
Copy Markdown
Owner Author

@bashtage There is a bug here around materializing model matrices from the subset model matrix; in particular, an exception is raised if new columns are added due to the full-rank algorithm. I'm also massaging this a bit more, respecting the term order in the resulting model spec, and adding a few more helper methods for completeness. Should land in a day or so.

@bashtage
Copy link
Copy Markdown
Contributor

Just a little positive feedback - it looks like this addition works on statsmodels when I take this branch, merge in main, and merge in my order preservation branch.

@bashtage
Copy link
Copy Markdown
Contributor

Ran into 1 issue using subset. If the original model has a spec like 1 + C(dummy) where, say dummy=[0,1,0,1,0,1,0,1], and you remove the intercept (drop the term "1"), then you get an error like

Term `C(dummy)` has generated too many columns compared to specification: generated ['C(dummy)[T.0.0]', 'C(dummy)[T.1.0]'], expecting ['C(dummy)[T.1.0]'].

@matthewwardrop
Copy link
Copy Markdown
Owner Author

Yeah... That's the issue I found above. Should have this one fixed pretty soon now that I know how I am going to deal with the . operator.

@matthewwardrop matthewwardrop merged commit cb88cc2 into main Nov 28, 2024
@matthewwardrop matthewwardrop deleted the add_model_spec_subset branch November 28, 2024 05:18
@matthewwardrop
Copy link
Copy Markdown
Owner Author

@bashtage This should all work nicely now in main. Let me know if it doesn't!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrating patsy's DesignInto.subset to ModelSpec.?

2 participants