-
Notifications
You must be signed in to change notification settings - Fork 1.4k
How to utilize vision of LLM? #159
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Hi, I tested the following code and it works both with a url and base64 images. As always, you can use OpenAI code to encode an image to base64: import base64
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# Path to your image
image_path = "image.jpeg"
# Getting the Base64 string
base64_image = encode_image(image_path) And then you can pass the base64 image or a url: from agents import Agent, Runner, ModelSettings
agent = Agent(
name="Assistant",
model="gpt-4o-mini",
model_settings=ModelSettings(temperature=0.4, max_tokens=1024),
instructions="Given an input image you will generate the description of the image in the style specified by the user."
)
result = await Runner.run(agent, input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": "Describe this image with an haiku."},
{
"type": "input_image",
"image_url": f"data:image/jpeg;base64,{base64_image}", # or your url "https://..."
},
]
}
])
print(result.final_output) I don't know if this is the best available method, but I hope this may be useful. |
@DanieleMorotti It works. Thanks! |
I still have a question about whether an MCP server can return an image as the result. I want to let the gpt-4o see how I can do this. |
I first tried to adopt The only idea that comes to my mind is to pass a function as from agents import Agent, Runner, ToolsToFinalOutputResult
def stop_at_tool(context, tools_resp):
res = tools_resp[0].output
return ToolsToFinalOutputResult(is_final_output=True, final_output=res)
mn_agent = Agent(
name="Image descriptor agent",
model="gpt-4o",
instructions="You have to describe the image requested by the user",
model_settings=ModelSettings(temperature=0.3, max_tokens=2048),
mcp_servers=[mcp_server],
tool_use_behavior=stop_at_tool
) Then, you may check the output returned by the mcp server and append to the chat history a new message with the returned image, such that you can correctly send the image to the LLM. |
Is there an API for the OpenAI SDK that facilitates seamless interaction between two different agents—specifically, a vision model and a language model—where the handoff between them is straightforward? How can I effectively communicate during this handoff whether I will be using the vision model or the large language model? Without such a mechanism, the input may fail when sent to the vision model if it is not appropriately formatted. This is particularly important if I do not implement the mechanism you described earlier. @DanieleMorotti |
The provided code gives me this as an error
using this code async def main():
base_agent = AsyncOpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
set_default_openai_client(base_agent)
agent = Agent(
name="Assistant",
instructions="You are a helpful assistant.",
model="gemma3:4b-it-qat",
)
result = await Runner.run(
agent,
input=[
{
"role": "user",
"content": [
{
"type": "input_text",
"text": "Can you tell me something about this image?",
},
{"type": "input_image", "image_url": IMAGE_URL},
],
},
],
)
print(result.final_output)
if __name__ == "__main__":
set_tracing_disabled(True)
set_default_openai_api("chat_completions")
asyncio.run(main()) |
@therjawaji you're right, there's an error and the code tries to access that key even if it doesn't exist. I implemented a PR to fix that. |
Question
How to utilize vision capability of LLM with OpenAI Agent SDK?
The API should support specifying image URL or local image path or base64 string. The documentation seems lacking this feature.
Code example of "Describe uploaded image" would be appreciated.
The text was updated successfully, but these errors were encountered: