Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Update dataset format #85

Open
sjmonson opened this issue Jan 23, 2025 · 0 comments
Open

Update dataset format #85

sjmonson opened this issue Jan 23, 2025 · 0 comments

Comments

@sjmonson
Copy link
Member

There a few minor changes that we should make to the dataset jsonl schema.

  • tok_output_length should be removed as it is redundant with output_tokens.
  • tok_input_length should be renamed to input_tokens for constancy.
  • system_prompt should be made optional.
  • We should have a field to specify the tokenizer used (possibly in the metadata).
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant