Skip to content

Prototype for source repository support in cabal.project #4289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

alexbiehl
Copy link
Member

This is a try at #2189

Please lets have a discussion about this. I would like to have this feature!

This allows to write the following in your cabal.project file

source-repository-package
  type: git
  location: https://github.com/....
  subdir: somesubdirinyourrepository
  tag: 481cabcab15501f300728bfa4f6de517d4492777

And it gets picked up as a dependency to your project.

@mention-bot
Copy link

@alexbiehl, thanks for your PR! By analyzing the history of the files in this pull request, we identified @dcoutts, @ezyang and @Ericson2314 to be potential reviewers.

This allows to write

```
source-repository-package
  type: git
  location: https://github.com/....
  subdir: somesubdirinyourrepository
  tag: 481cabcab15501f300728bfa4f6de517d4492777
```
@ezyang
Copy link
Contributor

ezyang commented Feb 2, 2017

The patch looks pleasingly short, but I don't know enough about readSourcePackage to assess if the implementation is correct. I'm inclined to accept this patch.

However, we're going to need tests and documentation. It would also probably be wise to refactor the brancher functions out from Distribution.Client.Get into their own module.

@phadej
Copy link
Collaborator

phadej commented Feb 2, 2017

One question: can I have same repository specified twice (with different subdir and/or different commits)?

@alexbiehl
Copy link
Member Author

alexbiehl commented Feb 2, 2017

@ezyang I have put the brancher in its own module. One thing that bothers me is that I think I do too much in readSourcePackage: Forking the repository and searching for a cabal file. Ideally I think I would fork the repositories (and later maybe download any remote tarballs) before findProjectPackages which then does all the heavy lifting with finding cabal files and such as it does now.

Another point is the calculation of the directory: Currently I take the repoTag as directory name which is not ideal I think: if some parameters change except for the tag we are not going to fork again! I would want a function hashSourceRepo :: SourceRepo -> HashValue.

And one more: I think we should allow a SourceRepo to take a list of sub directories. This way we can import multiple projects from a source repository with just one stanza in cabal.project. (When creating the hash for a SourceRepo we should exclude the repoSubdir field. So we can add new sub directories without forking again).

@alexbiehl
Copy link
Member Author

alexbiehl commented Feb 2, 2017

@phadej I guess so. Also see my last answer above for open questions.

@phadej
Copy link
Collaborator

phadej commented Feb 2, 2017

@alexbiehl yeah, it solves the different subdirectories, but not different commits. Maybe there should be comment somewhere where the repositories are cloned: hash based directory would make sense. Wouldn't worrry about cleaning them up and/or need to reclone on e.g. tag change (that's very rare operation and acceptable, but would mention that in docs).

@ezyang
Copy link
Contributor

ezyang commented Feb 3, 2017

One thing that bothers me is that I think I do too much in readSourcePackage: Forking the repository and searching for a cabal file. Ideally I think I would fork the repositories (and later maybe download any remote tarballs) before findProjectPackages which then does all the heavy lifting with finding cabal files and such as it does now.

Yes, this is interesting. I did some more code reading and here is how I think this works:

  • readSourcePackage is called from phaseReadLocalPackages to get some basic information about the local dependency (as recorded by SourcePackage.) This includes the package description and the package ID, but NOT necessarily the source code: this is why it's an UnresolvedSourcePackage. Instead, readSourcePackage can say, "this package is available from a remote tarball RemoteTarballPackage or a RepoTarballPackage, and then there's some code later (FetchUtils, I think) which actually goes ahead and downloads it.

  • This results in some awkward tension with source-repository-package, since the most convenient mechanism for getting the Cabal package description is... to also get the rest of the code. This is why your suggestion is the REVERSE of what the code is nudging you towards: you want to manage downloading the repos before reading out the Cabal description files! Furthermore, PackageLocation doesn't even have a case for requesting a download of an SM package, although there is a TODO block in the data structure for this case.

If we go by how the code is structured, we definitely should NOT be fetching the code from the repo at readSourcePackage; rather, we should return an unresolved source package with a pointer to the real place, and then let the fetcher code later handle it. But then, how do we get our hands on the actual Cabal file? Perhaps we can implement a "single file" fetcher for these source repositories, where we grab just the Cabal file, with the intention of downloading the full repository later. But this will probably break offline mode if we actually ping the remote repo, even when we already have a cached local copy hanging around. So, I don't think the code is structured correctly, and some refactoring is necessary: perhaps, as you suggest, we do need some pre-download step to get the branches where they need to be. Maybe @dcoutts has some ideas.

BY THE WAY, why don't remote tarballs from package repositories have this problem? In this case, we are expected to have an index which actually contains all of the Cabal files, to be able to discover this tarball. So in this case, we already have the package description without any extra downloading.

@ezyang
Copy link
Contributor

ezyang commented Feb 3, 2017

Another point is the calculation of the directory

I see you pushed a commit to hash it. I think this is reasonable as long as we don't expect users to ever want to interact with these checked out repos. Stack supports Git repos, no? I wonder what they do.

I think we should allow a SourceRepo to take a list of sub directories. This way we can import multiple projects from a source repository with just one stanza in cabal.project. (When creating the hash for a SourceRepo we should exclude the repoSubdir field. So we can add new sub directories without forking again).

Yes, I agree. This shouldn't be a difficult fix!

@alexbiehl
Copy link
Member Author

alexbiehl commented Feb 3, 2017

edit: sorry I misunderstood, I basically repeated what you said

@ezyang Its even worse in case of a ProjectPackageLocalTarball, ProjectPackageRemoteTarball and ProjectPackageRemoteRepo: there isn't even a package id! Just a FilePath, a URI or a SourceRepo. So we have to download and unpack these before we call findProjectPackage to find the cabal files needed.

So my current understanding is:

  • Fetching just one file out of a scm repository or a from a tarball seems awefully complicated.
  • Forking and unpacking for these cases needs to be done before we look for project packages.

And: stack currently clones the repository but lets users tamper with the cloned source without any restrictions the last time I checked.

@ezyang
Copy link
Contributor

ezyang commented Feb 3, 2017

Yeah. So I think we've agreed, the existing code thought this functionality should be implemented in a particular way, but that strategy doesn't actually work out of the box...

...but now I've just thought of a dastardly idea! So, let's think about the remote Hackage repository case: how did we get the package description here? It's because we downloaded an external index using "cabal update". So, what if we introduced a similar command for fetching package descriptions from remote tarballs / source repos / etc? The external motivation for this command is that it is the, "Download all the package sources info necessary, so that I can do the rest of my development offline." I don't really know where you'd store this info (maybe it'll end up looking like the repo hashes you are doing here), and you'd probably want new-build to automatically invoke this if it finds that it can't find the revision it needs, but it's one interesting idea about how to structure the logic here.

@alexbiehl
Copy link
Member Author

alexbiehl commented Feb 4, 2017

@ezyang sounds like a nice Symetry! What would be such a commands name?

Edit: I would like to give it shot. What you suggest would be something like "fork-my-repositories-and-download-my-remote-tarballs-so-I-can-use-new-build-on-my-project", right? Sounds like "new-fetch"

@Ericson2314
Copy link
Collaborator

Also, the freeze file should track the mapping of git refs (branches, tags) to revs (commit hash) and tree hash (since .git should be thrown away). [Btw, this idea I am stealing from Rust's Cargo.]

Do we already track the Hackage index revision similarly? Granted, that is far less important as Hackage only grows bigger.

@ezyang
Copy link
Contributor

ezyang commented Feb 6, 2017

@alexbiehl I like fetch; and we already have a cabal fetch command which is intended to download sources for later installation, so this would just be added functionality.

@dcoutts
Copy link
Contributor

dcoutts commented Feb 21, 2017

Indeed, the existing design allows for remote tarballs, and I think remote scm's here can fit that same pattern. It does mean we have to download the thing first even to extract the .cabal file. Only local dirs and hackage archives can short-cut that part by cheaply being able to read the .cabal file.

I'm not convinced this needs splitting out into a separate fetch step/command but I'm prepared to be convinced. As far as I can see if we follow the remote tarball pattern then it should be the same.

And @Ericson2314 makes a good point about freezing.

Copy link
Contributor

@dcoutts dcoutts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is going in the right direction, but there's a few things we need to think about.

As I said in a comment already, I think it's right that we do have to fetch the repo up front just to be able to read the .cabal file. I don't see any sane alternative, and it's the approach I was imagining for both source repos and remote tarballs.

But we do need to handle the case of the user editing the source repo input info better.

Now currently the strategy here seems to be to hash the source repo input info and do a fresh clone. And if the hash has not changed and the local checkout exists then it is assumed to be immutable.

But that means if the user edits the tag/checkout or other info hen we'll fetch a complete new copy of the repo. That does have a certain simplicity to it but I don't think it's sustainable; it's too expensive. So I think we need to be able to handle the incremental case: that there's already some checkout in the expected location and we need to synchronise that with the current info for that repo (e.g. changed checkout, changed subdirs, changed URL even).

Clearly this is more complicated and will need more features in the branchers. My initial suggestion is that we think of it as a synchronise operation: we pass the brancher impl the location of the current checkout and the info about what we want it to be (URL, tag etc) and ask it to cheaply check if it's already there and if not to synchronise it to the target info.

We also need to extend the branchers to be able to give us content hashes, as we currently do for hackage packages. We want to be able to treat source repo packages as non-local dependencies, including installing them into the store (and thus sharing them between runs in CI systems or between multiple projects). This means we need to be able to compute nix style hashes. For git that should be easy: the current checkout hash is perfect. But this obviously is brancher specific, so we'd need to extend the brancher API for that.

-- hashing.
hashSourceRepo :: SourceRepo -> HashValue
hashSourceRepo sourceRepo =
hashValue $ Binary.encode (sourceRepo { repoSubdir = Nothing })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, this is just to pick the location of the repository or is this to calculate the source hash for the content of the repo. In which case I don't think this belongs in this module which is about the package id hashes which are based on content hashes.

@@ -0,0 +1,155 @@
module Distribution.Client.Get.Brancher where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we prefer explicit export lists. People introduce private utils during refactoring and shouldn't have to think about introducing an export list at the same time.


readSourcePackage verbosity distDirLayout (ProjectPackageRemoteRepo repo) = do
let
scmSrcDir = "scm-" ++ showHashValue (hashSourceRepo repo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this is a good idea. I think we ought to be able to pick generally non-clashing names here without having to use huge hashes.

So what's the issue? People said they wanted to be able to have the same repo but with two different checkouts (and presumably two different subdirs otherwise they've selecting two versions of the same package which isn't supported with local dirs either).

So I can see that picking non-clashing names is an issue, but on the other hand we don't want to have to do a fresh checkout when someone updates the project file to point to a different tag. In those cases we should simply be able to fetch & reset. Note that this may imply more functionality in the branchers than we have right now, since currently we can only get a fresh checkout I think right?

-- we only need this for error messages
repoloc = fromMaybe "" (repoLocation repo)

repoExists <- liftIO $ doesDirectoryExist destDir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, here is where we will need to think about the incremental case. Notice that we are in a Rebuild monad. Currently, this readSourcePackage will get re-run every time the cabal.project file changes. That could be adjusted so we only re-run when the input information for the source repo changes.

BTW, notice that for the local file case we do a monitorFiles [monitorFileHashed cabalFile] to force this code to get re-run when the user edits that file. For source repos we don't need to worry about the user editing things under us I think.

-> readSourcePackage
verbosity
distDirLayout
(ProjectPackageLocalDirectory pkgdir cabalFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to convert this into a project local package, we want to preserve the information that this package is from a source repo and not necessarily treat it as local. That is we should be able to treat it like a package from hackage.

If we do this then it means we can put packages built from repos into the store and share them between multiple projects. This would be a big win. But it also means we need a mechanism to obtain a hash of the sources in the repo. This does not have to be a content hash of the unpacked dir, any stable content hash will do. So for git for example, the current git checkout hash is perfect. For other repo types other methods may be needed.

@dcoutts
Copy link
Contributor

dcoutts commented Feb 22, 2017

One thing that might help here is if one of us implements the local or remote tarball case, since that covers the general pattern of when do we download, dealing with changes in the remote tarball URL (the only info for a tarball, obviously source repos have more), and also the issue about getting source hashes so we can treat these tarball packages as non-local deps.

@Ericson2314
Copy link
Collaborator

This isn't great for git's garbage collecting, but it's perfectly possible to just fetch everything to the same local repo, and let git discover all possible incremental downloading that way.

@dcoutts
Copy link
Contributor

dcoutts commented Mar 15, 2017

Related: the branch that #4399 is part of.

@erikd
Copy link
Member

erikd commented May 2, 2017

Only just noticed this PR. I've been using a thin wrapper around cabal called mafia for about a year now and I think the way it handles code not on Hackage seems somewhat better than what is proposed here. First and foremost, add a project does not require any changes to the host project's cabal file.

@bgamari
Copy link
Contributor

bgamari commented Sep 27, 2017

What is the status of this? I would also really like to have this feature.

@23Skidoo
Copy link
Member

Looks stalled.

@thoughtpolice
Copy link
Member

This would also be extremely useful for Clash since I recently added .project files for it -- we have a few GHC plugins that currently need to be cloned where they otherwise could be handled by this.

I mentioned this to @dcoutts recently and IIRC he suggested that he'd like part of the work for extra-packages to land as well for this feature (currently in wip/extra-packages). @dcoutts -- thoughts on what needs to happen next? Now that the Summer of Code is over we have less bandwidth...

@fgaz
Copy link
Member

fgaz commented Oct 31, 2017

extra-packages is in

@gbaz
Copy link
Collaborator

gbaz commented Mar 24, 2018

The provenance-qualified imports proposal (ghc-proposals/ghc-proposals#115) is relevant. Note that even if that passes, nearly all the work on this branch remains relevant as the core trickiness is in the fetching logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.