-
Notifications
You must be signed in to change notification settings - Fork 35
Monorepo vs. multi-repo #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It came up during our weekly sync, so I wanted to document it, we could use extras_require to separate dependencies, and even provide a convenient way to install a complete sgkit ex: |
@jeromekelleher @alimanfoo @tomwhite @eric-czech @hammer any thoughts/objections/-1/+1 to merging IO repos into the main repo and using |
I'm a +1 to that as long as it's always easy to install the io-related packages in a separate environment from all the other sgkit packages (and not need to share common dependencies). Something like this seems reasonable to me:
I think making CI run for everything in a common environment will work for now, but I don't think that would be a good long term solution. How hard is it to split the build up like that? |
@eric-czech I believe (we should double check), install of any |
And to provide some further reasoning for this:
Regarding 2nd point, to be clear, we could likely have a "core" package that brings all packages, some of which would require extra dependencies, and lack of those would trigger a
|
There's a long past discussion on this starting around here btw: https://github.com/pystatgen/sgkit/issues/2#issuecomment-646488636. I definitely agree with @jeromekelleher here in that I plan on using our IO tools once for most analyses, or twice to export as VCF/PLINK, and that some of those tools won't ever work well in the same environment. On the other hand, packages like pysnptools align with our stack (at the moment) and I agree that there are a bunch of advantages to making them colocate in the same environment. I'm not sure which way to lean with other IO tools but I think it will be important to understand sooner than later how packages that can't share environments will be integrated. For some more context on the current IO packages, pysnptools (requirements) depends on bgen-reader (requirements) and cumulatively they have pretty much the same dependencies as us, plus a few things like the native bgen c library. At least those two would be manageable for now and arguably we should pick up the maintenance if the two current maintainers ever drop it. |
Thanks for moving this discussion forward, @ravwojdyla! It's a big decision so we will definitely not decide one way or the other without input from @jeromekelleher and @tomwhite when they return. And hopefully we will get an email response from @CarlKCarlK and @horta so that we can more closely coordinate with their work on PLINK and BGEN IO. |
Hi guys, I will respond you later tonight. Sorry about that. Also, I'm getting one week soon to work on those projects again! |
I would be onboard with moving sgkit-plink and sgkit-bgen into the sgkit repo, and using For VCF we currently depend on scikit-allel to convert to Zarr as an intermediate format (see https://github.com/pystatgen/sgkit/pull/40). In the future we may want a tool to go directly from VCF to sgkit format, and that might be complex enough to warrant a separate package (let's see). But the sgkit-specific code for plink and bgen has turned out to be pretty minimal, so I think it could happily live in sgkit. |
I do see the arguments for merging everything together @ravwojdyla, and I agree it is a burden on us to keep the repos separate. However, I do think it would be a fundamental mistake to merge everything together and to put the burden of working out dependencies on the user. It is a lot easier for a user to understand whether they need to install Suppose I'm a user who wants to convert and process their VCF data. I'm lazy, so I don't read all the documentation saying what packages I need to install for what and so I run Under the current scheme, there's two ways we could recommend users approach this:
The point is, it's much easier for inexperienced users to understand that they need So, I agree it's a burden on us to maintain n different sgkit import/export libraries, but this is much better than putting the burden of understanding the dependency network for n different formats on users. OK, it's a PITA now to sync up n different repos in terms of coding style and so on, but this is just the initial phase when we're getting things up and running. Once the basic import/export functions are done, I'd bet that we'll rarely touch the format repos. All of the real development will happen in the sgkit repo, and import/export repos will basically be in maintenance mode, updating every now and again to deal with upstream updates. |
@jeromekelleher good points, so maybe a good middle ground would be to not have |
Yeah, this is a good idea @ravwojdyla - but I still don't think we should merge the repos! Fundamentally, I think it's short-sighted, but it's probably best to talk this through during a call. If everyone else agrees that we should merge the repos then, I'll go with the flow (and only complain and grumble occasionally 😉). Fair enough? |
FWIW, here's the list of packages we install in a fresh venv for sgkit:
and this is what we have for sgkit-plink:
The first looks nice and light to me, and the second... doesn't. |
Greetings,
Hi! I’m the lead on the PySnpTools project. I’m late to the sgkit discussion and still not fully up-to-speed.
But let me tentatively say that I’d consider working on (and/or supporting) an effort to move PySnpTools’s Bed, etc reader/writer to a new minimal project. Then PySnpTools and sgkit-plink etc could take a dependency on it. (It’s possible sgkit-plink could be that minimal project.).
PySnpTools is a big project whose purpose is to allow one to write projects like FaST-LMM<https://github.com/fastlmm/FaST-LMM> so that they can run on any data format, any clustering scheme, any distributed file system, etc. I’m fine with moving some of its low-level file reading code into a more shareable project.
* Carl
|
Welcome @CarlKCarlK! I'm excited to see how we can best integrate your work -- thanks for being willing to collaborate on it. |
For our next meeting tomorrow, I thought maybe it would be good to document the But first I will describe the potential setup:
Now I will dive into consequences, will use: {benefit, drawback, same} to mark the points:
This is not an exhaustive list, nevertheless I see more benefits in merging the repos (I might be missing something tho). Feel free to comment here, and let's also discuss this further during our weekly tomorrow. |
Thanks for the great summary @ravwojdyla, this is very helpful! |
I've changed the title of this issue since the conversation has mostly centered upon the question posed by @ravwojdyla in the final sentence of the original issue comment:
|
For posterity, we have decided to consolidate the IO repos into the main repo, and try the design from https://github.com/pystatgen/sgkit/issues/65#issuecomment-670049733 |
Closing this out now that I've archived the IO-specific repositories. |
Uh oh!
There was an error while loading. Please reload this page.
We have a solid style check setup in the main repo, IO repos have a bit older setup, it might be hard to keep them in sync plus we also have repos in RS that also have similar setups. I wonder if we should explore some kind of centralisation of this style check setup (akin to common plugin in sbt or root parent in maven). The best I could find for precommit is this, we could give it a try, one downside I can see is that it might not work well with mypy check (but that already doesn't work well anyway even in solo repo, related: https://github.com/pystatgen/sgkit/issues/39)
An alternative solution to this problem would be to have a single repo with proper separation of modules to prevent dependency hell.
The text was updated successfully, but these errors were encountered: