-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: data folder processing in datapreprocessor #417
Conversation
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Thanks for making a pull request! 😃 |
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @willmj. We might need to support files/folder as per our discussion here.
tuning/data/data_processors.py
Outdated
files = [] | ||
for path in datasetconfig.data_paths: | ||
if os.path.isdir(path): | ||
# If the path is a folder, collect all files within it | ||
folder_files = [ | ||
os.path.join(path, file) | ||
for file in os.listdir(path) | ||
if os.path.isfile(os.path.join(path, file)) | ||
] | ||
files.extend(folder_files) | ||
else: | ||
files.append(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this discussion and comments by @dushyantbehl, we are looking to support files/folder like this:
- if
extension
is found then use that as theloader
- if it is a folder then pass the folder directly
- else fallback on the hf dataset id
One reason is as discussed by Ashok here that glob.glob
OR os.listdir
can be a performance bottleneck as it iterate through files in a folder once, hence we can avoid that.
As mentioned here in datasets.load_dataset, you can directly pass the directory path here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Abhishek-TAMU thanks for the explanation! I missed this thread, that makes sense. I pushed up some changes to have it work if the user is passing a single folder. I have some additional questions about how we plan to support this now, which might be better answered by @ashokponkumar and @dushyantbehl:
- Are we assuming the user will only pass 1 folder if a folder is passed?
- If not, are we assuming the user will only pass folders or files and not a combination?
- Does our current implementation work with the HF dataset ID or does additional functionality need to be added for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willmj Thanks for the PR
Yes the user will pass just one folder per dataset we will support only 1 (for simplicity) as HF seems to support only 1.
User has to pass either a folder or files, you can assume that if a single path is specified it can be checked with a isfile
or isdir
checks
I don't see any reason why our code won't be able to handle a HF dataset ID so supporting that would be great!
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Description of the change
Changes the way data is processed:
from
files = datasetconfig.data_paths
toTo be rebased on top of #412
Related issue number
How to verify the PR
Unit tests or run training passing in a data config with a data folder as the
data_paths
Was the PR tested