Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Make LFS http_client parallel within a batch. #32369

Merged
merged 7 commits into from
Nov 4, 2024

Conversation

rremer
Copy link
Contributor

@rremer rremer commented Oct 29, 2024

No description provided.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Oct 29, 2024
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 29, 2024
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label Oct 29, 2024
@rremer
Copy link
Contributor Author

rremer commented Oct 29, 2024

This is somewhat related to #32307 , as without that mirroring upstream lfs files is limited to 20 concurrent threads. For my own internal mirroring, setting [lfs_client].BATCH_SIZE=400 added ~100MB of memory usage and shaved 3 days off of initial mirroring for our admittedly quite large repo :) .

} else {
err := dc(object.Pointer, nil, object.Error)
if errors.Is(object.Error, ErrObjectNotExist) {
log.Warn("Ignoring missing upstream LFS object %-v: %v", object.Pointer, err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point out that this actually reduces the usefulness of this logline. I moved it out of modules/repository/repo.go, but there it had access to which repository it was operating on, so you know where to investigate if you want to track down the lfs pointer/blob. Here in modules/lfs/http_client.go, I don't have access to the repository name. I was thinking I could add a value to the Context which is passed down from repo to here?

@rremer rremer force-pushed the lfs-parallel-within-batch branch from dbb14d5 to 82754f0 Compare October 29, 2024 17:18
}
continue
}
func(groupCtx context.Context, object *ObjectResponse, dc DownloadCallback, uc UploadCallback, transferAdapter TransferAdapter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please advise, are you suggesting that I can just do this?:

for _, object := range result.Objects {
  go performSingleOperation(groupCtx, object, dc, uc, transferAdapter)
}

Copy link
Contributor

@wxiaoguang wxiaoguang Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think so if it is right.

Maybe just for { errGroup.Go(...) } is enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually, this closure seems necessary. Admittedly I used this pattern because I'd seen it in practice elsewhere, not because I fully understood why it was in use, but errGroup.Go only takes a function without arguments; you cannot pass arguments to functions inside errGroup. An anonymous function enclosure like this is the only way I see to do this.

Copy link
Contributor

@wxiaoguang wxiaoguang Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly I used this pattern because I'd seen it in practice elsewhere,

In history, it is due to https://go.dev/blog/loopvar-preview
(https://github.com/go-gitea/gitea/pull/32369/files#r1821738812)

not because I fully understood why it was in use

Well, that's why I would like to clarify the problem. My philosophy is that "do not write the code which you do not understand"

but errGroup.Go only takes a function without arguments; you cannot pass arguments to functions inside errGroup.

I do not see why it is a problem.

(the demo code in https://github.com/go-gitea/gitea/pull/32369/files#r1823705885)

for _, object := range result.Objects {
	errGroup.Go(func() error {
		return performSingleOperation(groupCtx, object, dc, uc, transferAdapter)
	})
}

@rremer rremer force-pushed the lfs-parallel-within-batch branch 3 times, most recently from 76ccd7c to 46b730a Compare October 30, 2024 23:10
@rremer
Copy link
Contributor Author

rremer commented Oct 30, 2024

had one more force-push after make tidy to handle a compliance check (moving x/sync from indirect to a direct dependency in go.mod). I believe I've addressed all comments except for this one which I'd like clarification/direction.

return err
}
// this was NOT an 'upload' request inside the batch request, meaning it must be a 'download' request
err := dc(object.Pointer, nil, object.Error)
Copy link
Contributor

@wxiaoguang wxiaoguang Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have the question here (a new thread):

Now the code is:

if uc != nil {
    if err := uc(); err != nil {
        return err
    }
    // did you miss a return here?
}
// It seems that when uc != nil and uc() == nil, then dc is called by mistake?
err := dc()
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, but no I believe this is correct. What I think may be confusing is that the existing code on main assumes that the uploadCallback will return an error if it is called with a non-nil object.Error. This whole closure is predicated by that, we are guaranteed to have a non-nil object.Error and thus the existing logic (which perhaps not very defensive) is not flawed, a call to uc or dc will always also return an error. So the flow of this code is like:

  1. object.Error is non-nil, meaning this specific oid within a batch has some kind of error, maybe it was an upload request, maybe it was a download request, we don't know yet
  2. if the upload callback was non-nill, it means this was an upload request, execute the upload callback with the object.Error and get a guaranteed error returned
  3. if it was not an upload callback, we must have a download callback, as mentioned in the previous thread there are currently only two kinds of operations allowed by the lfs batch api. Previously this was in an 'else' block, I removed that by advisement of the linter, as the else block would always get executed and it outdents this code even further.

Put another way, when object.Error != nil this closure will always return either error from uploadCallback or downloadCallback, OR nil from downloadCallback if the error should be ignored (404). No other value can be returned from this, it's just a switch for executing the correct upload/dowload callbacks.

Copy link
Contributor

@wxiaoguang wxiaoguang Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. if the upload callback was non-nil, it means this was an upload request, execute the upload callback with the object.Error and get a guaranteed error returned

That's the problem. Which test or document guarantees that "upload callback returns an error" in this case?

The changed logic is fragile:

  • It does change the original if-flow -- if "upload callback" returns nil. The old code just "continues" to next without calling "download callback", but new code calls "download callback".
  • It is counter-intuitive, code readers must think a lot more about how "upload callback" work to understand whether there is a bug.
  • If the upload callback behaviors would be changed in the future (for example: ignores some errors), the new logic is difficult to maintain.

To be intuitive, defensive programming, and easier to understand, I think we should choose one explicit behavior:

if uc != nil {
    if err := uc(); err != nil {
        return err
    }
    // choose one explicit behavior:
    // * "return nil" means the upload callback could decide to ignore some kinds of errors, not stop the uploading
    // * "return err" means whatever upload callback does, any error should stop the uploading
    //
    // Actually "return nil" is the old behavior, I think it was well-designed, 
    // and it makes it possible to ignore some errors.
    return nil
}

@rremer rremer force-pushed the lfs-parallel-within-batch branch from 46b730a to ef84dfa Compare October 31, 2024 22:25
@@ -113,6 +115,7 @@ func (c *HTTPClient) Upload(ctx context.Context, objects []Pointer, callback Upl
return c.performOperation(ctx, objects, nil, callback)
}

// performOperation takes a slice of LFS object pointers, batches them, and performs the upload/download operations concurrently in each batch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, I do not think "batch size" means "concurrency limit" in this case. If I understand correctly:

  • "batch size" means how many objects to be processed in one time.
  • "concurrency limit" means how many concurrent connections to the LFS server.

Maybe "batch size" could be 100 or 1000 without causing problem, but "concurrency limit" should be much smaller, eg: 5 or 10. A lot of connections to a LFS server might trigger their rate-limit protection, or be considered as somewhat DoS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always in favor of making things more configurable, however as an administrator I think most of the time you would want [lfs_client].BATCH_SIZE = (the number of concurrent threads for upload/download within a batch). I'm happy to add a new config though, what should we call it and what should its default be? [lfs_client].BATCH_MAX_THREADS ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe [lfs_client].MAX_OPERATION_CONCURRENCY ? (since there is no "thread" concept in Golang)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, how about [lfs_client].BATCH_OPERATION_CONCURRENCY, as I feel there may be other single-threaded operations still in this code that we might want configs for then, and this is specific to lfs client batch operations? I updated this branch with that config, and defaulted it to what [lfs_client].BATCH_SIZE if unset or <1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to match the default git-lfs's behavior:

https://github.com/git-lfs/git-lfs/blob/4cca3f8e7d77c5b86cdfe34c2e6bba2126e3d5cd/docs/man/git-lfs-config.5.ronn#L25-L33

lfs.concurrenttransfers

The number of concurrent uploads/downloads. Default 3.

@rremer rremer force-pushed the lfs-parallel-within-batch branch 2 times, most recently from 2e5a251 to 13f2a7c Compare November 1, 2024 20:32
@github-actions github-actions bot added the docs-update-needed The document needs to be updated synchronously label Nov 1, 2024
Signed-off-by: Royce Remer <royceremer@gmail.com>
@rremer rremer force-pushed the lfs-parallel-within-batch branch from 13f2a7c to 4653cad Compare November 1, 2024 22:31
@rremer
Copy link
Contributor Author

rremer commented Nov 1, 2024

last force-push was just a rebase from main, this should be ready to review again. I believe I've addressed all comments although suspect there may be opinions on my choice of the new config key under [lfs_client]

@wxiaoguang
Copy link
Contributor

Made 2 commits:

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Nov 2, 2024
@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Nov 4, 2024
@wxiaoguang wxiaoguang enabled auto-merge (squash) November 4, 2024 04:20
@wxiaoguang wxiaoguang merged commit 54146e6 into go-gitea:main Nov 4, 2024
26 checks passed
@GiteaBot GiteaBot added this to the 1.24.0 milestone Nov 4, 2024
@@ -66,6 +67,11 @@ func loadLFSFrom(rootCfg ConfigProvider) error {
LFSClient.BatchSize = 20
}

if LFSClient.BatchOperationConcurrency < 1 {
// match the default git-lfs's `lfs.concurrenttransfers`
LFSClient.BatchOperationConcurrency = 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default in lfs client is actually 8, not 3: https://github.com/git-lfs/git-lfs/blob/main/lfshttp/client.go#L89 , but I've documented what you set here: https://gitea.com/gitea/docs/pulls/88

Copy link
Contributor

@wxiaoguang wxiaoguang Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch .... actually there are different documents (I read an out-dated one) .... the default value also ever changed. 😅

The new document: https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-config.adoc also says 8

Copy link
Contributor

@wxiaoguang wxiaoguang Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update it to 8 and update the documents.

-> Use 8 as default value for git lfs concurrency #32421

Thank you very much.

@lunny lunny modified the milestones: 1.24.0, 1.23.0 Nov 5, 2024
zjjhot added a commit to zjjhot/gitea that referenced this pull request Nov 6, 2024
* giteaofficial/main: (21 commits)
  Use 8 as default value for git lfs concurrency (go-gitea#32421)
  Fix milestone deadline and date related problems (go-gitea#32339)
  Only query team tables if repository is under org when getting assignees (go-gitea#32414)
  Refactor RepoRefByType (go-gitea#32413)
  Refactor template ctx and render utils (go-gitea#32422)
  Refactor DateUtils and merge TimeSince (go-gitea#32409)
  Refactor markup package (go-gitea#32399)
  Add some handy markdown editor features (go-gitea#32400)
  Make LFS http_client parallel within a batch. (go-gitea#32369)
  Refactor repo legacy (go-gitea#32404)
  Replace DateTime with proper functions (go-gitea#32402)
  Fix git error handling (go-gitea#32401)
  Fix created_unix for mirroring (go-gitea#32342)
  Replace DateTime with DateUtils (go-gitea#32383)
  improve performance of diffs (go-gitea#32393)
  Refactor tests to prevent from unnecessary preparations (go-gitea#32398)
  Add artifacts test fixture (go-gitea#30300)
  Fix `missing signature key` error when pulling Docker images with `SERVE_DIRECT` enabled (go-gitea#32365)
  Fix a number of typescript issues (go-gitea#32308)
  Update go dependencies (go-gitea#32389)
  ...
@go-gitea go-gitea locked as resolved and limited conversation to collaborators Feb 2, 2025
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
docs-update-needed The document needs to be updated synchronously lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/dependencies modifies/go Pull requests that update Go code size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants