Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Enable valid OpenAI response_format specification #1069

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

liamcripwell
Copy link

When specifying the response_format within the current version of the OpenAI API it expects this via an object containing a "type" attribute, e.g. {"type": "<type>"}. However, distilabel is enforcing a string representation for this, which leads to either an error or silent failure.

E.g. when using a TextGeneration task under the existing codebase:

text_gen = TextGeneration(
    llm=OpenAILLM(
        model="gpt-4o",
        generation_kwargs={
            "response_format": "json"
        },
    )
)
text_gen.load()

output = next(
    text_gen.process(
        [{"instruction": "Convert this info to a JSON: John Smith is 30 years old."}]
    )
)

The OpenAI API will fail and yield BadRequestError: Error code: 400 - {'error': {'message': "Invalid type for 'response_format': expected an object, but got a string instead.", 'type': 'invalid_request_error', 'param': 'response_format', 'code': 'invalid_type'}}.

The same happens when directly calling generation from the LLM:

llm = OpenAILLM(
    model="gpt-4o",
)

llm.load()

output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

Presumably the same happens for requests to the batch api, which ultimately leads to AssertionError: No output file ID was found in the batch..

load = llm = OpenAILLM(
    model="gpt-4o",
    use_offline_batch_generation=True,
    offline_batch_generation_block_until_done=2,  # poll for results every 5 seconds
)

llm.load()
output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

This pr simply wraps the string representation of the specified response_format inside the object expected by OpenAI.
I have also added the same value checking that is done in agenerate() to offline_batch_generate().

@plaguss
Copy link
Contributor

plaguss commented Nov 25, 2024

Hi @liamcripwell thanks for the PR! This bug was found and is already fixed in develop: updated agenerate method. The next release will include this fixed

@liamcripwell
Copy link
Author

Hi @plaguss, great to hear it's already been fixed. Sorry, I didn't notice this change in develop.

However, I still think the doctstring for agenerate should be further updated because it still says that response_format must be either "text" or "json". This is no longer true as the method now only accepts a dictionary and will fail the pydantic validation if a string is provided.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants