Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

added some flexibility to create your custom benchmark splits #307

Merged
merged 5 commits into from
Jan 23, 2025

Conversation

optimass
Copy link
Collaborator

added some flexibility to create your custom benchmark splits

@optimass optimass changed the title Main added some flexibility to create your custom benchmark splits Jan 17, 2025
Copy link
Collaborator

@gasse gasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my comments:

This is an API change for users, ideally it would be best to avoid that. Also the method's behavior is quite different depending on whether task_splits is provided or not, which is a bit confusing IMO. Why not implement a new explicit method, like subset_from_custom_splits() or subset_from_task_list() ? This would solve both problems.

Otherwise LGTM. I let you decide whether to keep like this or not @optimass :)

@optimass
Copy link
Collaborator Author

@gasse yes, @recursix made the same suggestion :)

recursix
recursix previously approved these changes Jan 22, 2025
if split_column not in self.task_metadata.columns:
raise NotImplementedError(
f"This benchmark does not provide default train/valid/test splits (missing a {repr(split_column)} column in task metadata)"
def subset_from_split(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be a much better fit to have two separate function. Instead of a if with two completely separate code.

I would keep the original subset_from_split function untouch and add a
subset_from_task_list(task_names: list[str], benchmark_name_suffix: str)
function.

But technically, you could achieve exactly this with subset_from_regex and pass a long regex. subset_from_task_list could be a wrapper to generate the regex, it might be more convenient than having to specify exactly each task names.

In your code task_splits is only used with task_splits[split], so instead of creating a dict, just directly use a list

@optimass optimass merged commit 5a879e1 into main Jan 23, 2025
13 checks passed
@optimass optimass deleted the custom_splits branch January 23, 2025 18:26
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants