Skip to content
This repository was archived by the owner on Sep 11, 2020. It is now read-only.

Tree iteration performance #636

Open
tajtiattila opened this issue Nov 4, 2017 · 4 comments
Open

Tree iteration performance #636

tajtiattila opened this issue Nov 4, 2017 · 4 comments

Comments

@tajtiattila
Copy link

I'm trying to iterate a tree and list filenames and their hashes within a repo, using https://gist.github.com/tajtiattila/f7357e2536c5ae5b411607ebfa71d5e9

However the performance seems to be at least two orders of magnitude slower than that of Git. Is there a better way to do it?

My intent was to use go-git to check changed files in a custom pre-commit hook for git.

OS is Windows 10. Repo is on an SSD and has large with binary files in it.

$ du -h -s .git
4.9G    .git
$ time git ls-tree --full-tree -r e2f631e842beb18cdc81cd1551ca38367a62dd76 >/dev/null

real    0m0.163s
user    0m0.000s
sys     0m0.000s
$ time commititer e2f631e842beb18cdc81cd1551ca38367a62dd76 >/dev/null

real    0m29.752s
user    0m0.000s
sys     0m0.016s
$
@antham
Copy link
Contributor

antham commented Dec 5, 2017

After running pprof, it seems we waste time (not only there) in ReadByte in teeReader struct from plumbing/format/packfile/scanner.go. Just as experiment, I switched from writing in Hash to write instead in a buffer, the following is what I get :

Before :
before.pdf

real    0m11,769s
user    0m12,664s
sys     0m0,602s

After :
after.pdf

real    0m8,929s
user    0m10,273s
sys     0m0,589s

Results depends on the computer load but I get something between 2 or 3 seconds faster, I did it with 10942 files. Another point where the program seems to stuck is when we have file a bit heavy, I mean something around the megabyte, it takes a bit of time to process.

We I run git I get a result in less than a second so I guess there are many path to improve that.

@robfig
Copy link

robfig commented Dec 19, 2017

I wanted to report unusably slow performance on calculation of object.Commit.Patch(). On my company's git repo, it takes > 10s to list 10 commits and their number of files affected, while it takes 0.04s for the git command line to do the same.

This is probably the same root cause. I took a profile but nothing stood out to me as obviously the thing to do to fix the situation. Here's the SVG CPU profile in case it's helpful
https://gist.github.com/robfig/6ec8ae976bd5ade7a7fcfbe1d9089522#file-gitpatchprofile-svg

This is the code I'm using to get the list of files in a commit. I saw the code in #562 and will try using that to get this information instead.

// commitFilenames returns the filenames contained by the given commit.
func commitFilenames(commit, parent *object.Commit) []string {
	patch, err := commit.Patch(parent)
	glog.FatalIf(err)

	var filenames []string
	for _, filePatch := range patch.FilePatches() {
		_, to := filePatch.Files()
		if to != nil {
			filenames = append(filenames, to.Path())
		}
		// Ignore file deletions
	}
	return filenames
}

Thanks for all your work on this great project!

@smacker
Copy link

smacker commented Feb 14, 2018

@tajtiattila if you need only filenames and hashes you better use TreeWalker:

tree, err := commit.Tree()
if err != nil {
// handle error
}
seen := make(map[plumbing.Hash]bool)
iter := object.NewTreeWalker(tree, true, seen)
for err == nil {
    name, entry, err = iter.Next()
    fmt.Println(name, entry.Hash)
}
if err == io.EOF {
    err = nil
}

it will work much much faster because .Files() method returns FileIter which reads blobs of every file.

Still slower than git though:

$ time git ls-tree --full-tree -r 77866167330a665e174ae08a2f8902ef9dc3438b > /dev/null

real	0m0.154s
user	0m0.034s
sys	0m0.030s

$ time ./listfiles 77866167330a665e174ae08a2f8902ef9dc3438b > /dev/null

real	0m0.690s
user	0m0.797s
sys	0m0.131s

@tajtiattila
Copy link
Author

@smacker Thank you. The docs

func (c *Commit) Files() (*FileIter, error)
Files returns a FileIter allowing to iterate over the Tree

suggest FileIter is just another way to get the tree, so I haven't even considered TreeIter.

I did not know FileIter is inherently slower than TreeIter. Does it have to be?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants