-
Notifications
You must be signed in to change notification settings - Fork 35
Append output variables from functions to input dataset #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nice summary, thanks @eric-czech. I'd be instinctively inclined to prefer your "ex-situ" pattern above, but I'm not an experienced enough xarray user to have a real opinion. Something to consider: what if we have a common keyword arg to functions |
I like that. A way forward there that makes sense to me is:
I would mildly recommend we try to get ahead of the fact that the merging can get complicated and consider something like a EDIT Actually, the more I think about using that the more I think it makes sense to either set update=False or have the function take full control over the merging. |
I'm with you until (3) here @eric-czech:
I would say "At the very end of each function merge the new variables in based on the flag and return them". There's no reason not to return the variables in the I agree |
Yea sorry, that's what I meant. Either you get the original dataset with the new variables added to it or you get a dataset just containing the new variables.
Me too. A pipeline would then look like |
I don't think this is right. The merged dataset is a new object, so the original dataset is unchanged. (They share backing data arrays, of course, but these should be viewed as immutable.) This is why I think there is a strong argument for making Perhaps it should be called |
Ah yea agreed on that as long as we don't do |
Sounds good @tomwhite and @eric-czech - +1 to So, just so I'm totally clear here, the proposed pattern is ds2 = sgkit.some_function(ds1, merge=True)
ds3 = sgkit.some_function(ds1, merge=False) We have:
Do I have this right? If so, SGTM, and yes, |
Exactly.
Hail does it too and I agree it's one less thing for new users to learn. In the interest of moving this forward, it sounds like we're all in agreement on https://github.com/pystatgen/sgkit/issues/103#issuecomment-672091886 (the numbered steps a few posts above) with the only difference being we'll call the boolean flag |
What is being described here is very similar to the
I think it's fair to assume that most users will have at least passing familiarity with pandas and hence replicating that behavior may be the most intuitive. Pandas defaults to creating a new dataframe with the changes applied to that copy (not the input arg).
Edit: Though I'm not exactly certain on how pandas deals re shallow vs deep copies when creating the new DataFrame |
Just to clarify, in @jeromekelleher example the default behavior But |
Thanks @timothymillar, this is very helpful input and great to know. |
+1 |
One other thing I'd like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created. Given that we want to standardize on a global naming convention and that some functions will produce a large number of variables, I would be mildly opposed to that. One practical situation I can think of that could require customizing variable names is when trying to prefix or suffix them as part of some workflow that colocates the same variables from different sources. An example would be if I wanted a dataset containing both 1000 genomes and UKB variables aligned on some coordinates or index. I think the ability to "align" datasets like that is actually a really killer feature for sgkit and that it's also something that's supported well over multiple Datasets by xr.align or xr.sel with indexes (so customizing names isn't really necessary). I can almost imagine some scenarios where something like a tf.name_scope would be helpful for controlling variable suffixes/prefixes, but it also seems fairly easy to work around with Anyways, I think it would be good to know where we all stand on that before making changes. |
I don't really have an opinion. Let's build some stuff and think about making rules later, would be my take. |
Related to this what happens if a value in the input-data set is re-calculated? will this be common? If a user re-calculates
The warnings can then be promoted to errors in 'production' pipelines ensure that there is no variable clobbering.
A simple minded approach to this would be to include the arguments |
Today I've been trying to create a PR for #43 and I'm in a little bit of a pickle. We currently have only a handful of computation functions, but already bunch of inconsistencies:
I would argue that consistent API is something we should strive for (and we currently have only ~4 functions), I can see that there is already an argument made for the 1st issue here - and I fully support always returning |
TBH, what I'd personally like most as a user is:
This would mean I can use whatever naming conventions I'd like for any variables in any context, and it would be ok if some of the input/output variables are conditional. This contradicts the idea of a global namespace for variables (like I mentioned in https://github.com/pystatgen/sgkit/issues/87#issuecomment-672231814) but it would be most flexible. I see it as the polar opposite to any kind of "schemification" that would enforce the global names.
I think if we go the schemification route, it may make sense to do away with any control the user has over variable names? The two extremes I see are I completely agree we should find something more consistent! |
Thanks for raising the inconsistencies @ravwojdyla. I think where it is required to specify variable names, that should be clear from required parameters in the function (e.g. As a new user I would expect to be able to run a pipeline with all the defaults and have it just work. So that implies there is some kind of global namespace. I like the Eric's idea of being able to override all input variables, but no output variables (user is responsible for that). And +1 to making |
Can this be closed with the merge of #217? |
Not every function will fit the
fn(ds: Dataset, ...) -> Dataset
signature, but for the large majority that do we have so far been adopting a convention that only returns newly created variables. Another option would be to always try to write those variables into the input dataset. For lack of a better phrase I'll call the former "Ex situ" updates and the latter "In situ" updates. Here are some pros/cons of each:Ex situ updates
ds.merge(fn(ds))
is clunky if you wanted the variables merged in the first placeds = xr.merge([ fn1(ds), fn2(ds) ])
In situ updates
ds.pipe(count_alleles).pipe(allele_frequency)
My more opinionated view of this is that the ex situ updates are best in larger, more sophisticated workflows. I say that because I think those workflows will be very much dominated by Xarray code with calls to individual sgkit functions interspersed within it, so the need for chaining sgkit operations is not very high and the need to transform/interrogate results before merging them in is higher.
It would be great to find some way to get all of those advantages above though. Perhaps there is a way to do something like this more elegantly:
A potential issue with this is that functions may rely on variables from other functions (see https://github.com/pystatgen/sgkit/pull/102#issue-465519140) and add them when necessary, which would mean the set of output variables isn't always the same. I don't think this is actually an issue with Dask as long as the definition of the variables is fast. It would be for most functions and Dask will remove any redundancies, so presumably we wouldn't have to necessarily return or add them (i.e. this affects both the in situ and ex situ strategies). Using numpy variables definitely wouldn't work that way but I'm not sure that use case is as practical outside of testing anyhow.
Some of @ravwojdyla's work to "schemify" function results would make it easy to know what variables a function produces (https://github.com/pystatgen/sgkit/issues/43#issuecomment-669478348). There could be some overlap here with how that works out too.
The text was updated successfully, but these errors were encountered: