Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

shuffile file not restored after a rebuild #39

Closed
adammoody opened this issue Oct 24, 2023 · 0 comments · Fixed by #40
Closed

shuffile file not restored after a rebuild #39

adammoody opened this issue Oct 24, 2023 · 0 comments · Fixed by #40

Comments

@adammoody
Copy link
Contributor

adammoody commented Oct 24, 2023

The shuffile file is not correctly restored after rebuilding lost files. It should contain a list of files owned by the rank:

RANK
    1
        FILE
            /dev/shm/user123/scr.1628391/scr.dataset.13/.scr/reddesc.er.1.redset
            /dev/shm/user123/scr.1628391/scr.dataset.13/.scr/reddesc.er.1.xor.grp_1_of_1.mem_2_of_2.redset
            /dev/shm/user123/scr.1628391/scr.dataset.13/rank_1.ckpt
        FILES = 3

But after a rebuild, it only consists of:

RANK = 0

The shuffile file is not part of the set of files protected by the redundancy encoding, so it needs to be restored independently.

It could be reconstructed by reading the list of files from the redset after it has succeeded and before deleting it here:

er/src/er.c

Lines 737 to 741 in 9e88b5b

if (redset_recover(comm_world, redset_path, &d) != REDSET_SUCCESS) {
/* rebuild failed, rc is same value across comm_world */
rc = ER_FAILURE;
}
redset_delete(&d);

The simplest approach would be to just recreate the shuffile file on every rank (even those where it already exists) by calling shuffile_create again:

er/src/er.c

Lines 683 to 687 in 9e88b5b

/* associate list of both app files and redundancy files with calling process */
if (shuffile_create(comm_world, comm_store, count, filenames2, shuffile_file) != SHUFFILE_SUCCESS) {
/* failed to register files with shuffile */
rc = ER_FAILURE;
}

To do that, we need to be able to list the files in the redset. The redset_filelist_get function only returns redundancy encoding files. We will need to add a new function to redset to return the full file list: user files + redundancy files.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant