Skip to content
This repository was archived by the owner on Sep 11, 2020. It is now read-only.

packfile: cache undeltified objects to improve decode performance #218

Merged
merged 1 commit into from
Jan 25, 2017
Merged

packfile: cache undeltified objects to improve decode performance #218

merged 1 commit into from
Jan 25, 2017

Conversation

ajnavarro
Copy link
Contributor

@ajnavarro ajnavarro commented Jan 19, 2017

Simple object cache that keep in memory the last undeltified objects. When no more objects can be keept into memory, the oldest one is deleted.

Benchmarks:

  • master branch:
clone: ~4m2s
count: ~3m40s count: 207956
  • this PR:
clone: ~19s
count: ~14.2s count: 207956

@ajnavarro
Copy link
Contributor Author

Feel free to add any comment or ways to improve this.

@codecov-io
Copy link

codecov-io commented Jan 19, 2017

Current coverage is 76.36% (diff: 80.88%)

Merging #218 into master will decrease coverage by 0.55%

@@             master       #218   diff @@
==========================================
  Files            96         98     +2   
  Lines          6299       6359    +60   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           4845       4856    +11   
- Misses          922        978    +56   
+ Partials        532        525     -7   

Powered by Codecov. Last update 85a1642...b5fad06

Copy link
Collaborator

@smola smola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

  1. We should discuss about if cache should be a package on its own.
  2. It would be nice if you could measure what's the speed up of cloning the git/git repository with this change.

if max <= 0 {
max = 1
}
c.maxElements = max
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after if


func (c *cache) Add(o plumbing.EncodedObject) {
if len(c.order) >= c.maxElements {
d, order := c.order[0], c.order[1:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite inefficient, we should probably implement a queue as a circular list backed by a slice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely.

@@ -62,6 +67,8 @@ type Decoder struct {

offsetToType map[int64]plumbing.ObjectType
decoderType plumbing.ObjectType

cache *cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make cache public in its own package, since an object cache might be useful in other cases...

@@ -63,8 +63,7 @@ func (s *BaseSuite) NewRepositoryFromPackfile(f *fixtures.Fixture) *Repository {
p := f.Packfile()
defer p.Close()

n := packfile.NewScanner(p)
d, err := packfile.NewDecoder(n, r.s)
d, err := packfile.NewDecoder(p, r.s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this PR change the way of instantiating Scanner/Decoder?

@@ -128,9 +138,20 @@ func (d *Decoder) Decode() (checksum plumbing.Hash, err error) {
return d.s.Checksum()
}

func (d *Decoder) doDecode() error {
func (d *Decoder) Count() (uint32, error) {
_, count, err := d.s.Header()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if Count is called twice, the decoding is broken?

return 0, err
}

d.cache.SetMaxElements(int(count / magicCacheNumber))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the max elements of cache inside this method (public Count) looks wierd.

@@ -128,9 +138,20 @@ func (d *Decoder) Decode() (checksum plumbing.Hash, err error) {
return d.s.Checksum()
}

func (d *Decoder) doDecode() error {
func (d *Decoder) Count() (uint32, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you the documentation, I don't know the goal of this function

"gopkg.in/src-d/go-git.v4/plumbing"
)

type cache struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a description about how operates this cache?

@mcuadros
Copy link
Contributor

The improvement is great, but why? Will be great know the average of the re-used objects in a repository, maybe we can have a more sophisticated algorithm. How is done this at jgit and libgit?

Copy link
Contributor

@alcortesm alcortesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome contribution.


func (c *cache) Add(o plumbing.EncodedObject) {
if len(c.order) >= c.maxElements {
d, order := c.order[0], c.order[1:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely.

@ajnavarro
Copy link
Contributor Author

I wrote from scratch the implementation, taking into account all your comments. Basically, is the same implementation as JGit.

@@ -105,6 +108,8 @@ func NewDecoderForType(s *Scanner, o storer.EncodedObjectStorer,

offsetToType: make(map[int64]plumbing.ObjectType, 0),
decoderType: t,

cache: cache.NewObjectFIFO(cache.MaxSize),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache.MaxSize should be added as external repository configuration. Some idea of how to do it correctly?

Simple object cache that keep in memory the last undeltified objects. When no more objects can be keept into memory, the oldest one is deleted.
@ajnavarro
Copy link
Contributor Author

Code used to check times:

package main

import (
	"time"
	"os"

	"gopkg.in/src-d/go-git.v4"
	"gopkg.in/src-d/go-git.v4/plumbing"
	"gopkg.in/src-d/go-git.v4/storage/filesystem"
	
	. "gopkg.in/src-d/go-git.v4/examples"
	osfs "srcd.works/go-billy.v1/os"
)

func main() {
	for i := 0; i < 3; i++ {
		directory := "/tmp/git/"
		os.RemoveAll(directory)
		s, err := filesystem.NewStorage(osfs.New(directory))
		CheckIfError(err)

		r, err := git.NewRepository(s)
		CheckIfError(err)
		Info("cloning")
		now := time.Now()
		err = r.Clone(
			&git.CloneOptions{
				URL: "file:///path/to/git/repository",
			},
		)
		CheckIfError(err)
		spent := time.Since(now)
		Info("clone: %s", spent)

		Info("starting to iterate")
		iter, err := s.ObjectStorage.IterEncodedObjects(plumbing.AnyObject)
		CheckIfError(err)

		count := 0
		now = time.Now()
		err = iter.ForEach(func(o plumbing.EncodedObject) error {
			count++
			return nil
		})
		CheckIfError(err)

		spent = time.Since(now)
		Info("count: %s count: %v", spent, count)
	}
}

@smola
Copy link
Collaborator

smola commented Jan 25, 2017

We should probably improve this in the future with an LRU cache with maximum memory size, but so far it looks great. LGTM.

@smola smola merged commit 97c0273 into src-d:master Jan 25, 2017
@smola smola mentioned this pull request Jan 26, 2017
mcuadros pushed a commit that referenced this pull request Jan 31, 2017
* Simple object cache that keeps in memory the last undeltified objects.
  When no more objects can be kept into memory, the oldest one is deleted (FIFO).
  This speeds up packfile operations preventing redundant seeks and decodes.
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants