[Feature] Add Tulu3 SFT/DPO Mixture Dataset Support #1361

oelachqar · 2025-02-04T05:23:02Z

Feature request

Add Tulu3 SFT Mixture Dataset Support

Problem

Tulu3 is a high-quality instruction dataset created through careful curation and filtering of various sources.

We need to add support for the Tulu3 SFT mixture dataset to expand our training dataset options.

Tasks

[ ] Add dataset loading support for Tulu3 mixture
[ ] Create a dataset transform that handles Tulu3's conversation format
[ ] Add example training config using this dataset
[ ] Add tests for the dataset loader

Implementation Notes

Dataset is available on HuggingFace at allenai/tulu-3-sft-mixture
Should follow our existing dataset registration patterns

Getting Started

Look at existing dataset implementations in src/oumi/datasets/
Review Tulu3's data format on HuggingFace
Start with implementing basic dataset loading

Related: OPE-790

Motivation / references

https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B

Your contribution

If somebody can volunteer to start this work, I can answer questions and help with testing.

The text was updated successfully, but these errors were encountered:

bwalshe · 2025-02-04T10:54:22Z

I can take a look at this.

One thing that is a little confusing to me is that on the data viewer it looks like the messages entry is a list of dicts, but in the raw data it is a list of lists. (I'll put an example below, in case that doesn't make sense.)

~~Also, should the conversation always have two entries?~~

Ok, it looks like I was mistaken about the way the lists are constructed, and I can see examples where there are multiple messages in a single conversation, so I will format it accordingly.

I haven't got a real plan in place for how I am going to test, as I don't really see anything comparable for the other datasets. How would you like me to proceed?

bwalshe · 2025-02-05T10:52:08Z

I've created a WIP PR here #1381. The example config will need some work, but let me know what you think.

taenin · 2025-02-12T18:05:27Z

Looks good, thanks for the contribution @bwalshe !

oelachqar added the good first issue Good for newcomers label Feb 4, 2025

xrdaukar added the help wanted Extra attention is needed label Feb 4, 2025

wizeng23 added the Feature label Feb 7, 2025

wizeng23 mentioned this issue Feb 7, 2025

[Feature][Config] Add Tulu3/Olmo2 model configs #1407

Open

wizeng23 changed the title ~~[Feature] Add Tulu3 SFT Mixture Dataset Support~~ [Feature] Add Tulu3 SFT/DPO Mixture Dataset Support Feb 11, 2025

taenin mentioned this issue Feb 12, 2025

[Feature] Add Tulu3 SFT Mixture Dataset Support #1381

Merged

4 tasks

taenin closed this as completed Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Tulu3 SFT/DPO Mixture Dataset Support #1361

[Feature] Add Tulu3 SFT/DPO Mixture Dataset Support #1361

oelachqar commented Feb 4, 2025 •

edited

Loading

bwalshe commented Feb 4, 2025 •

edited

Loading

bwalshe commented Feb 5, 2025

taenin commented Feb 12, 2025

[Feature] Add Tulu3 SFT/DPO Mixture Dataset Support #1361

[Feature] Add Tulu3 SFT/DPO Mixture Dataset Support #1361

Comments

oelachqar commented Feb 4, 2025 • edited Loading

Feature request

Add Tulu3 SFT Mixture Dataset Support

Problem

Tasks

Implementation Notes

Getting Started

Motivation / references

Your contribution

bwalshe commented Feb 4, 2025 • edited Loading

bwalshe commented Feb 5, 2025

taenin commented Feb 12, 2025

oelachqar commented Feb 4, 2025 •

edited

Loading

bwalshe commented Feb 4, 2025 •

edited

Loading