Skip to content

suggestion: implement jsonformer for generating JSON #1300

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
bakkot opened this issue May 3, 2023 · 11 comments
Closed

suggestion: implement jsonformer for generating JSON #1300

bakkot opened this issue May 3, 2023 · 11 comments
Labels
good first issue Good for newcomers

Comments

@bakkot
Copy link
Contributor

bakkot commented May 3, 2023

This is a neat idea: basically, constrain the output to a particular subset of tokens so that you are guaranteed to generate data of a particular format, and also fill in other context after each piece of output automatically.

In this specific example the format is "JSON with a particular schema", and that's a good place to start, although the technique obviously generalizes.

@ggerganov ggerganov added the good first issue Good for newcomers label May 3, 2023
@ggerganov
Copy link
Member

Great task for a llama.cpp example!

Btw, this is along the lines of the constrained Whisper sampling idea for chess moves: https://twitter.com/ggerganov/status/1640441536403116032
I think this will be another very cool example for the whisper.cpp project

@abetlen
Copy link
Collaborator

abetlen commented May 3, 2023

This is something I've been working on, I have constrained JSON parsing implemented but not the full JSONSchema spec using the llama.cpp python bindings.

I wrote a custom tree-sitter parser that can parse partial JSON files and samples tokens accordingly. The tree-sitter parser generates a single c file that I believe should be easy to use in a c++ example if anyone's interested in taking that approach. Validating against the JSONSchema may be harder to do in C++, not sure if there are any good libraries.

@ggerganov
Copy link
Member

@abetlen

Looking great!
My impression is that constrained sampling is under-utilized today and there are many cool applications of this approach that are yet to be demonstrated.

@ejones
Copy link
Collaborator

ejones commented May 11, 2023

#1397 looks like it could address this

@bakkot
Copy link
Contributor Author

bakkot commented May 11, 2023

#1397 is related, but doesn't (currently) do what this issue is asking for.

@ggerganov
Copy link
Member

Relevant:

@loretoparisi
Copy link

@ggerganov
Copy link
Member

#1887

@arthurwolf
Copy link

I've found the docs about this and am very interrested.

However, I'm really not sure how to write a grammar for generating JSON...

Does anyone have an example to provide? As JSON is given as an example of a possible thing to do in the grammar docs, it'd be great if an example of how to do that was provided.

Thanks.

@ejones
Copy link
Collaborator

ejones commented Aug 24, 2023

For generating arbitrary JSON, there's a JSON grammar provided in grammars/json.gbnf:

% ./main -m $L13B -p 'The weather for today: ' --grammar-file grammars/json.gbnf
...

 The weather for today: {"temp":450, "pressure":36.0, "humidity":890}

For conforming to a JSON schema, there's examples/json-schema-to-grammar.py :

% cat ../schemas/student.json 
 {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "is_student": {"type": "boolean"},
        "courses": {
            "type": "array",
            "items": {"type": "string"}
        }
    }
}
% ./main -m $L13B -p 'Hermione Granger ' --grammar "$(python3 examples/json-schema-to-grammar.py ../schemas/student.json --prop-order 'is_student,name,age,courses')"
...

 Hermione Granger {"is_student":true, "name":"Hermione","age":12,"courses":[ "Arithmancy", "Defense Against the Dark Arts", "Divination", "Muggle Studies", "Herbology", "Potions" ]}

@arthurwolf
Copy link

arthurwolf commented Aug 25, 2023 via email

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants