Skip to content

Put intervals.py into tskit #2636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented Nov 10, 2022

Description

As noted in tskit-dev/tsinfer#753 (comment). Some reworking was needed to use the built-in tskit table formatting functions when outputting the RateMap to unicode or html. Once a stable tskit release with this is released, we can remove it from msprime and simply treat mprime.RateMap as an alias for tskit.RateMap

@hyanwong
Copy link
Member Author

hyanwong commented Nov 10, 2022

Note that some HapMap files omit the first column with the chromosome name (e.g. https://ftp.ncbi.nlm.nih.gov/hapmap/recombination/latest/rates/). I wonder if in read_hapmap we should take this opportunity to switch to taking the last-but-two and last columns, rather than the 1st and 3rd? This would allow us to read both HapMap format files, and mostly shouldn't change behaviour on the files that were read before, unless people have extra columns in their HapMap files (you would hope not). But I guess it is potentially breaking behaviour for unusually formatted HapMap files with extra columns.

@codecov
Copy link

codecov bot commented Nov 10, 2022

Codecov Report

Merging #2636 (bfdc04c) into main (71c699b) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2636      +/-   ##
==========================================
+ Coverage   93.90%   93.95%   +0.04%     
==========================================
  Files          27       28       +1     
  Lines       27655    27872     +217     
  Branches     1269     1302      +33     
==========================================
+ Hits        25969    26186     +217     
  Misses       1648     1648              
  Partials       38       38              
Flag Coverage Δ
c-tests 92.25% <ø> (ø)
lwt-tests 89.05% <ø> (ø)
python-c-tests 71.47% <29.28%> (-0.77%) ⬇️
python-tests 98.97% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/intervals.py 100.00% <100.00%> (ø)
python/tskit/tables.py 98.94% <100.00%> (-0.03%) ⬇️
python/tskit/util.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71c699b...bfdc04c. Read the comment docs.

@hyanwong
Copy link
Member Author

This will be needed for the more flexible tskit HMM matching algorithm.

@benjeffery
Copy link
Member

Hmm, codecov is increasingly flakey, have triggered a re-run.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more uncovered here than I'd like. I know it will be covered by the msprime tests eventually, but it should be covered here.

@hyanwong
Copy link
Member Author

There's more uncovered here than I'd like. I know it will be covered by the msprime tests eventually, but it should be covered here.

Thanks for looking at this. I agree about the coverage (I was waiting a bit to see what CI picked up). I'll add tests later in the week and ping you when done @benjeffery.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 14, 2022

Oh - before I add more tests, just to check that my general approach in this PR is a sensible one, @benjeffery?

@benjeffery
Copy link
Member

Yeah, seems good to me!

@jeromekelleher
Copy link
Member

As we are relicensing this code from GPL to MIT, we need to look at the git history and make sure that everyone who has contributed agrees to this change. Looking at the file in the msprime code:

$ git shortlog -sen msprime/intervals.py
    14  Jerome Kelleher <[email protected]>
     3  Yan Wong <[email protected]>
     3  peter <[email protected]>
     1  Graham Gower <[email protected]>
     1  awohns <[email protected]>

So, we have five contributors, @jeromekelleher, @hyanwong, @petrelharp @grahamgower and @awohns.

Can each of these contributors please confirm that they wish to relicense this code from GPLv3 to MIT?

@jeromekelleher
Copy link
Member

I approve the relicensing of this code originally developed in msprime from GPLv3 to MIT.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 15, 2022

As we are relicensing this code from GPL to MIT, we need to look at the git history and make sure that everyone who has contributed agrees to this change.

Oh, well spotted.

Can each of these contributors please confirm that they wish to relicense this code from GPLv3 to MIT?

Yes, I am happy to relicense all of my contributions to this code (originally developed in msprime) from GPLv3 to MIT.

@awohns
Copy link
Member

awohns commented Nov 15, 2022

I also approve the relicensing of this code originally developed in msprime from GPLv3 to MIT.

@grahamgower
Copy link
Member

Me too. I approve the relicensing of this code originally developed in msprime from GPLv3 to MIT.

@hyanwong
Copy link
Member Author

Coverage looks OK now. Just waiting to the the remaining approval from @petrelharp

@hyanwong
Copy link
Member Author

I also took the liberty of modifying get_cumulative_mass(self, positions) so that if positions was not provided, we use the positions defined in the RateMap. I think this makes it clearer what the get_cumulative_mass function does.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 17, 2022

I notice that there's a "to do" in the deprecated msprime routines here:

    def physical_to_genetic(self, x):
        return self.map.get_cumulative_mass(x)

    def genetic_to_physical(self, genetic_x):
        if self.map.total_mass == 0:
            # If we have a zero recombination rate throughout then everything
            # except L maps to 0.
            return self.get_sequence_length() if genetic_x > 0 else 0
        if genetic_x == 0:
            return self.map.position[0]
        # TODO refactor this to this to use get_cumulative_mass() function / add the
        # corresponding high-level function to the rate map.
        index = np.searchsorted(self.map._cumulative_mass, genetic_x) - 1
        y = (
            self.map.position[index]
            + (genetic_x - self.map._cumulative_mass[index]) / self.map.rate[index]
        )
        return y

I guess we want to implement a RateMap equivalent of genetic_to_physical, just like get_cumulative_mass is the RateMap equivalent of physical_to_genetic. What would we call it? I would presume it would look like this:

class RateMap:
    ...
    def cumulative_mass_lookup(self, cumulative_mass):
        """
        Returns a set of genomic positions corresponding to the passed-in
        cumulative mass values. This is the inverse of :meth:`get_cumulative_mass`
        
        .. note:: If the RateMap is a "genetic map" measuring the
            rate of recombination along a chromosome, this function will return
            the chromosomal position (i.e. the physical distance) for the provided
            genetic map values.

        :param numpy.ndarray cumulative_mass: The values for which to return genomic
            positions.

        :return: An array of positions along the genome, the same length as
            ``cumulative_mass``
        :rtype: numpy.ndarray

        """
        if self.total_mass == 0:
            # If we have a zero rate throughout then anything above 0 maps to L.
            return self.sequence_length if cumulative_mass > 0 else 0
        if cumulative_mass == 0:
            return 0
        map_c_mass = self.get_cumulative_mass()
        i = np.searchsorted(map_c_mass, cumulative_mass) - 1
        positions = self.position[i] + (cumulative_mass - map_c_mass[i]) / self.rate[i]
        return x

@petrelharp
Copy link
Contributor

I approve the license change!

@jeromekelleher
Copy link
Member

I don't think we care about the genetic_to_physical that much any more, and it's certainly not something we need to get in for the initial code drop. We can open an issue to track as needed.

We want to keep the code provenance clear here, so let's do the minimum necessary to port across in this PR.

Can you update the commit message that copies in the intervals.py code to include the agreements from each of the contributors please @hyanwong? Ideally we'd have a commit that copies it in directly with no changes, so that the updates are obvious later if/when someone does a code review for license violations.

We want to be completely clear about licensing issues.

To match the msprime `text_table` function
Allows it to be used by other tabular outputting routines.

Also put the "limit" functionality into a single util function
hyanwong added a commit to hyanwong/tskit that referenced this pull request Nov 22, 2022
The authors of this module and associated tests have agreed to release the code originally developed for the msprime package under the MIT licence (see tskit-dev#2636). Specifically:

@jeromekelleher: "I approve the relicensing of this code originally developed in msprime from GPLv3 to MIT."

@hyanwong: "I am happy to relicense all of my contributions to this code (originally developed in msprime) from GPLv3 to MIT."

@awohns: "I also approve the relicensing of this code originally developed in msprime from GPLv3 to MIT."

@grahamgower: "I approve the relicensing of this code originally developed in msprime from GPLv3 to MIT."

@petrelharp: "I approve the license change!"
@hyanwong hyanwong force-pushed the intervals branch 4 times, most recently from d0dd629 to d83900d Compare November 22, 2022 10:27
@hyanwong
Copy link
Member Author

Can you update the commit message that copies in the intervals.py code to include the agreements from each of the contributors please @hyanwong? Ideally we'd have a commit that copies it in directly with no changes, so that the updates are obvious later if/when someone does a code review for license violations.

Done - commit f246afe is the port, with associated approvals. The other commits are just to get this working in tskit. I will PR the other changes separately.

The authors of this module and associated tests have agreed to release the code originally developed for the msprime package under the MIT licence (see tskit-dev#2636). Specifically:

@jeromekelleher: "I approve the relicensing of this code originally developed in msprime from GPLv3 to MIT."

@hyanwong: "I am happy to relicense all of my contributions to this code (originally developed in msprime) from GPLv3 to MIT."

@awohns: "I also approve the relicensing of this code originally developed in msprime from GPLv3 to MIT."

@grahamgower: "I approve the relicensing of this code originally developed in msprime from GPLv3 to MIT."

@petrelharp: "I approve the license change!"
Rework a little to use the built-in tskit unicode and html output functions
@@ -0,0 +1,600 @@
#
# MIT License
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep the original copyright notices I'm afraid.

I think it might be easier to start this with a new branch:

  • first copy in the msprime code as-is, with no changes. Make a commit, with the message including text from contributors like we have
  • Next update the utils code as required, and commit
  • Then make the changes to intervals.py and tests needed to bring them into tskit, and commit. This includes new copyright headers in addition to the existing ones.

Sorry I know this is tedious, but licencing issues always are and we really don't want there to be any grey areas with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, with a bit of cherry picking from these commits and rebasing you should be able to get to this setup reasonably easily. I can do it if you like.

Copy link
Member Author

@hyanwong hyanwong Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about avoiding grey areas. I think we need to copy the tests as well as the intervals.py file. I was hoping to do this so that the intermediate commit still passed tests, but I guess you are saying we don't need to worry about that.

How do we make it clear in the phrasing that the code is under MIT and GPL? Also, if all contributors agree to the change, surely we can re-release just under MIT (although ISWYM about the copyright rather than the licence)? But I guess you've looked into this a lot and there's some reason we have to jointly keep GPL too, @jeromekelleher

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if you are happy to do it, and it's easy, go ahead @jeromekelleher . That way it'll be sure to be done as you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave it with me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks

@hyanwong hyanwong mentioned this pull request Jan 5, 2023
@benjeffery benjeffery added this to the Python 0.5.4 milestone Jan 5, 2023
jeromekelleher added a commit to jeromekelleher/tskit that referenced this pull request Jan 10, 2023
The authors of this module and associated tests have agreed to release
the code originally developed for the msprime package under the MIT
licence (see tskit-dev#2636). Specifically:

@jeromekelleher: "I approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@hyanwong: "I am happy to relicense all of my contributions to this code
(originally developed in msprime) from GPLv3 to MIT."

@awohns: "I also approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@grahamgower: "I approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@petrelharp: "I approve the license change!"
@jeromekelleher
Copy link
Member

Superseded by #2678

benjeffery pushed a commit to jeromekelleher/tskit that referenced this pull request Jan 13, 2023
The authors of this module and associated tests have agreed to release
the code originally developed for the msprime package under the MIT
licence (see tskit-dev#2636). Specifically:

@jeromekelleher: "I approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@hyanwong: "I am happy to relicense all of my contributions to this code
(originally developed in msprime) from GPLv3 to MIT."

@awohns: "I also approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@grahamgower: "I approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@petrelharp: "I approve the license change!"
jeromekelleher added a commit to jeromekelleher/tskit that referenced this pull request Jan 13, 2023
The authors of this module and associated tests have agreed to release
the code originally developed for the msprime package under the MIT
licence (see tskit-dev#2636). Specifically:

@jeromekelleher: "I approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@hyanwong: "I am happy to relicense all of my contributions to this code
(originally developed in msprime) from GPLv3 to MIT."

@awohns: "I also approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@grahamgower: "I approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@petrelharp: "I approve the license change!"
mergify bot pushed a commit that referenced this pull request Jan 13, 2023
The authors of this module and associated tests have agreed to release
the code originally developed for the msprime package under the MIT
licence (see #2636). Specifically:

@jeromekelleher: "I approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@hyanwong: "I am happy to relicense all of my contributions to this code
(originally developed in msprime) from GPLv3 to MIT."

@awohns: "I also approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@grahamgower: "I approve the relicensing of this code originally
developed in msprime from GPLv3 to MIT."

@petrelharp: "I approve the license change!"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants