You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tulu3 is a high-quality instruction dataset created through careful curation and filtering of various sources.
We need to add support for the Tulu3 SFT mixture dataset to expand our training dataset options.
Tasks
[ ] Add dataset loading support for Tulu3 mixture
[ ] Create a dataset transform that handles Tulu3's conversation format
[ ] Add example training config using this dataset
[ ] Add tests for the dataset loader
Implementation Notes
Dataset is available on HuggingFace at allenai/tulu-3-sft-mixture
Should follow our existing dataset registration patterns
Getting Started
Look at existing dataset implementations in src/oumi/datasets/
One thing that is a little confusing to me is that on the data viewer it looks like the messages entry is a list of dicts, but in the raw data it is a list of lists. (I'll put an example below, in case that doesn't make sense.)
Also, should the conversation always have two entries?
Ok, it looks like I was mistaken about the way the lists are constructed, and I can see examples where there are multiple messages in a single conversation, so I will format it accordingly.
I haven't got a real plan in place for how I am going to test, as I don't really see anything comparable for the other datasets. How would you like me to proceed?
Feature request
Add Tulu3 SFT Mixture Dataset Support
Problem
Tulu3 is a high-quality instruction dataset created through careful curation and filtering of various sources.
We need to add support for the Tulu3 SFT mixture dataset to expand our training dataset options.
Tasks
[ ] Add dataset loading support for Tulu3 mixture
[ ] Create a dataset transform that handles Tulu3's conversation format
[ ] Add example training config using this dataset
[ ] Add tests for the dataset loader
Implementation Notes
allenai/tulu-3-sft-mixture
Getting Started
src/oumi/datasets/
Related: OPE-790
Motivation / references
https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B
Your contribution
If somebody can volunteer to start this work, I can answer questions and help with testing.
The text was updated successfully, but these errors were encountered: