You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The dataset used for the provided weights was 60k rows. Each scraped script is split into individual functions as an easy and reliable way to split code into chunks, so one function = one entry.
In practice, this resulted in 762 repositories being parsed for the training data, see godot_dodo_4x_60k_repos.json
I would say the lower bound of dataset sizes i've seen for LLaMA finetunes in general (not code-specific) sits around 15-20k rows.
I personally trained a 20k rows 7b model initially to judge whether or not this project was worth pursuing, but don't have any evaluations for that one. Still, it showed good enough results to continue, so that would be the sort of minimum i'd be looking at.
How much code in micro-language
Foo
do you actually need to train one of these?The text was updated successfully, but these errors were encountered: