Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: data folder processing in datapreprocessor #417

Closed
wants to merge 8 commits into from

Conversation

willmj
Copy link
Collaborator

@willmj willmj commented Dec 12, 2024

Description of the change

Changes the way data is processed:
from files = datasetconfig.data_paths to

            files = []
            for path in datasetconfig.data_paths:
                if os.path.isdir(path):
                    # If the path is a folder, collect all files within it
                    folder_files = [
                        os.path.join(path, file)
                        for file in os.listdir(path)
                        if os.path.isfile(os.path.join(path, file))
                    ]
                    files.extend(folder_files)
                else:
                    files.append(path)

To be rebased on top of #412

Related issue number

How to verify the PR

Unit tests or run training passing in a data config with a data folder as the data_paths

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Dec 12, 2024
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
@willmj willmj changed the title feat: [WIP] data folder processing in datapreprocessor feat: data folder processing in datapreprocessor Dec 13, 2024
@willmj willmj marked this pull request as ready for review December 13, 2024 19:32
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @willmj. We might need to support files/folder as per our discussion here.

Comment on lines 88 to 99
files = []
for path in datasetconfig.data_paths:
if os.path.isdir(path):
# If the path is a folder, collect all files within it
folder_files = [
os.path.join(path, file)
for file in os.listdir(path)
if os.path.isfile(os.path.join(path, file))
]
files.extend(folder_files)
else:
files.append(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this discussion and comments by @dushyantbehl, we are looking to support files/folder like this:

  • if extension is found then use that as the loader
  • if it is a folder then pass the folder directly
  • else fallback on the hf dataset id

One reason is as discussed by Ashok here that glob.glob OR os.listdir can be a performance bottleneck as it iterate through files in a folder once, hence we can avoid that.

As mentioned here in datasets.load_dataset, you can directly pass the directory path here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhishek-TAMU thanks for the explanation! I missed this thread, that makes sense. I pushed up some changes to have it work if the user is passing a single folder. I have some additional questions about how we plan to support this now, which might be better answered by @ashokponkumar and @dushyantbehl:

  • Are we assuming the user will only pass 1 folder if a folder is passed?
  • If not, are we assuming the user will only pass folders or files and not a combination?
  • Does our current implementation work with the HF dataset ID or does additional functionality need to be added for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willmj Thanks for the PR
Yes the user will pass just one folder per dataset we will support only 1 (for simplicity) as HF seems to support only 1.

User has to pass either a folder or files, you can assume that if a single path is specified it can be checked with a isfile or isdir checks

I don't see any reason why our code won't be able to handle a HF dataset ID so supporting that would be great!

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants