Enable valid OpenAI `response_format` specification #1069

liamcripwell · 2024-11-22T13:56:47Z

When specifying the response_format within the current version of the OpenAI API it expects this via an object containing a "type" attribute, e.g. {"type": "<type>"}. However, distilabel is enforcing a string representation for this, which leads to either an error or silent failure.

E.g. when using a TextGeneration task under the existing codebase:

text_gen = TextGeneration(
    llm=OpenAILLM(
        model="gpt-4o",
        generation_kwargs={
            "response_format": "json"
        },
    )
)
text_gen.load()

output = next(
    text_gen.process(
        [{"instruction": "Convert this info to a JSON: John Smith is 30 years old."}]
    )
)

The OpenAI API will fail and yield BadRequestError: Error code: 400 - {'error': {'message': "Invalid type for 'response_format': expected an object, but got a string instead.", 'type': 'invalid_request_error', 'param': 'response_format', 'code': 'invalid_type'}}.

The same happens when directly calling generation from the LLM:

llm = OpenAILLM(
    model="gpt-4o",
)

llm.load()

output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

Presumably the same happens for requests to the batch api, which ultimately leads to AssertionError: No output file ID was found in the batch..

load = llm = OpenAILLM(
    model="gpt-4o",
    use_offline_batch_generation=True,
    offline_batch_generation_block_until_done=2,  # poll for results every 5 seconds
)

llm.load()
output = llm.generate_outputs(
    inputs=[[{"role": "user", "content": "Convert this info to a JSON: John Smith is 30 years old."}]],
    response_format="json"
)

This pr simply wraps the string representation of the specified response_format inside the object expected by OpenAI.
I have also added the same value checking that is done in agenerate() to offline_batch_generate().

plaguss · 2024-11-25T08:08:54Z

Hi @liamcripwell thanks for the PR! This bug was found and is already fixed in develop: updated agenerate method. The next release will include this fixed

liamcripwell · 2024-11-25T09:46:42Z

Hi @plaguss, great to hear it's already been fixed. Sorry, I didn't notice this change in develop.

However, I still think the doctstring for agenerate should be further updated because it still says that response_format must be either "text" or "json". This is no longer true as the method now only accepts a dictionary and will fail the pydantic validation if a string is provided.

wrap specified str in obj

04c1065

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable valid OpenAI `response_format` specification #1069

Enable valid OpenAI `response_format` specification #1069

liamcripwell commented Nov 22, 2024

plaguss commented Nov 25, 2024

liamcripwell commented Nov 25, 2024

Enable valid OpenAI response_format specification #1069

Are you sure you want to change the base?

Enable valid OpenAI response_format specification #1069

Conversation

liamcripwell commented Nov 22, 2024

plaguss commented Nov 25, 2024

liamcripwell commented Nov 25, 2024

Enable valid OpenAI `response_format` specification #1069

Enable valid OpenAI `response_format` specification #1069