Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Feature Request: Add --worker Flag for Path-Hash-Based Partitioned Transfers in rclone #8400

Open
zackees opened this issue Feb 16, 2025 · 1 comment

Comments

@zackees
Copy link
Contributor

zackees commented Feb 16, 2025

Feature Request: Add --worker Flag for Partitioned, Hash-Based File Transfers

I'm going to do this myself in my python API to increase throughput. I thought i'd write a feature request for completeness. Feel free to close this feature request if not applicable. I may be able to implement this feature myself in rclone if this is something you are interested in.

Overview

I'd like to request a new feature that allows rclone to transfer only a portion of a server's content. This feature would enable users to run multiple rclone instances concurrently, with each instance responsible for a distinct subset of files. The goal is to facilitate distributed transfers and avoid duplicate work when syncing or copying large datasets.

Proposed Approach

Introduce a new flag, --worker, where the argument is formatted as worker_id:(n_workers-1). For example:

  • rclone copy ... --worker 0:1
  • rclone copy ... --worker 1:1

In the above example, two workers are deployed, and each will handle roughly 50% of the files.

How It Works

For each file to be transferred, rclone will calculate a hash based on the file's path (e.g., using MD5). Then, using the worker parameters, it determines if the current worker should process the file based on the following pseudocode:

worker_id = [provided worker id]
n_workers = [total number of workers]

for each file in files_to_copy:
    md5_hash = md5(file.path)
    # The addition of worker_id helps in balancing the distribution
    if (md5_hash + worker_id) % n_workers == 0:
         transfer(file)
    else:
         skip(file)
@zackees zackees changed the title Feature Request: Add --worker Flag for Hash-Based Partitioned Transfers in rclone Feature Request: Add --worker Flag for Path-Hash-Based Partitioned Transfers in rclone Feb 16, 2025
@ncw
Copy link
Member

ncw commented Feb 17, 2025

This is a great idea. So great that I'm actually already in the middle of implementing it :-)

Here is the proposal I made - comments welcome

Hash Filter

This proposal describes a new flag --hash-filter which is used to make a deterministic selection of a random subset of files.

Uses include:

  1. Running a big sync on multiple machines
  2. Checking a subset of files for bitrot

The flag takes two parameters expressed as a fraction, so --hash-filter 1/3 for example. Here the 3 represents the total number of subsets of files and the 1 represents which subset to select. So --hash-filter 1/3, --hash-filter 2/3 and --hash-filter 3/3 will all select different non-overlapping subsets of files.

Note that rclone will still have to traverse all directories to select these files.

The first parameter can be replaced with @ to select a random subset of files. In the example above --hash-filter @/3 means rclone will substitute the @ for a random number between 1 and 3 inclusive. The @ will be chosen and remain constant throughout the life of that set of filters, so any retries that are needed will use the same value.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants