Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Feature] Add Tulu3 SFT/DPO Mixture Dataset Support #1361

Closed
oelachqar opened this issue Feb 4, 2025 · 3 comments · Fixed by #1381
Closed

[Feature] Add Tulu3 SFT/DPO Mixture Dataset Support #1361

oelachqar opened this issue Feb 4, 2025 · 3 comments · Fixed by #1381
Labels
Feature good first issue Good for newcomers help wanted Extra attention is needed

Comments

@oelachqar
Copy link
Contributor

oelachqar commented Feb 4, 2025

Feature request

Add Tulu3 SFT Mixture Dataset Support

Problem

Tulu3 is a high-quality instruction dataset created through careful curation and filtering of various sources.

We need to add support for the Tulu3 SFT mixture dataset to expand our training dataset options.

Tasks

[ ] Add dataset loading support for Tulu3 mixture
[ ] Create a dataset transform that handles Tulu3's conversation format
[ ] Add example training config using this dataset
[ ] Add tests for the dataset loader

Implementation Notes

  • Dataset is available on HuggingFace at allenai/tulu-3-sft-mixture
  • Should follow our existing dataset registration patterns

Getting Started

  • Look at existing dataset implementations in src/oumi/datasets/
  • Review Tulu3's data format on HuggingFace
  • Start with implementing basic dataset loading

Related: OPE-790

Motivation / references

https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B

Your contribution

If somebody can volunteer to start this work, I can answer questions and help with testing.

@oelachqar oelachqar added the good first issue Good for newcomers label Feb 4, 2025
@bwalshe
Copy link
Contributor

bwalshe commented Feb 4, 2025

I can take a look at this.

One thing that is a little confusing to me is that on the data viewer it looks like the messages entry is a list of dicts, but in the raw data it is a list of lists. (I'll put an example below, in case that doesn't make sense.)

Also, should the conversation always have two entries?

Ok, it looks like I was mistaken about the way the lists are constructed, and I can see examples where there are multiple messages in a single conversation, so I will format it accordingly.

I haven't got a real plan in place for how I am going to test, as I don't really see anything comparable for the other datasets. How would you like me to proceed?

@xrdaukar xrdaukar added the help wanted Extra attention is needed label Feb 4, 2025
@bwalshe
Copy link
Contributor

bwalshe commented Feb 5, 2025

I've created a WIP PR here #1381. The example config will need some work, but let me know what you think.

@wizeng23 wizeng23 changed the title [Feature] Add Tulu3 SFT Mixture Dataset Support [Feature] Add Tulu3 SFT/DPO Mixture Dataset Support Feb 11, 2025
@taenin
Copy link
Collaborator

taenin commented Feb 12, 2025

Looks good, thanks for the contribution @bwalshe !

@taenin taenin closed this as completed Feb 12, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Feature good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants