Skip to content

[Vertex AI] Add responseModalities to GenerationConfig #14658

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 7 commits into from
Apr 9, 2025

Conversation

andrewheard
Copy link
Contributor

@andrewheard andrewheard commented Apr 4, 2025

Added support for setting response modalities (text or images) in GenerationConfig to enable image generation (public experimental) using Gemini 2.0 Flash (gemini-2.0-flash-exp only).

Usage Example
let vertexAI = VertexAI.vertexAI()
let model = vertexAI.generativeModel(
  modelName: "gemini-2.0-flash-exp",
  generationConfig: GenerationConfig(
    responseModalities: [.text, .image]
  )
)
let prompt = "Generate an image of a cute cartoon cat."

let response = try await model.generateContent(prompt)
for part in response.candidates.first!.content.parts {
  switch part {
  case let inlineDataPart as InlineDataPart:
    guard inlineDataPart.mimeType == "image/png" else {
      fatalError("Expected a PNG image, but got \(inlineDataPart.mimeType)")
    }
    // Handle Image
    guard let image = UIImage(data: inlineDataPart.data) else {
      fatalError("Failed to extract image from data.")
    }
    print("Received Image: \(image)")
  case let textPart as TextPart:
    // Handle Text
    print("Received Text: \(textPart.text)")
  default:
    fatalError("Received Unexpected Part Type: \(part)")
  }
}

This comment was marked as outdated.

@google-oss-bot
Copy link

1 Warning
⚠️ New public headers were added, did you remember to add them to the umbrella header?

Generated by 🚫 Danger

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the responseModalities parameter to the GenerationConfig, enabling support for multimodal responses (text and images) from the Gemini 2.0 Flash model. The changes include adding the ResponseModality type, updating the GenerationConfig struct and initializer, and adding a test case for image generation. Overall, the code is well-structured and addresses the intended functionality. However, there are a few areas that could be improved for clarity and maintainability.

Summary of Findings

  • Documentation for Public Preview Features: The documentation and comments clearly indicate the Public Preview status of the image generation feature. This is important for users to understand the limitations and potential changes associated with experimental features.
  • Error Handling in Example Usage: The example usage includes fatalError calls for unexpected scenarios. While this is acceptable for example code, consider using more robust error handling mechanisms in production code.
  • Naming Consistency: Consider renaming rawValue in ResponseModality to protoEnumValue to align with the naming convention used in other DecodableProtoEnum implementations.

Merge Readiness

The pull request is almost ready for merging. The added functionality is well-implemented and the code is generally clear. However, addressing the naming consistency issue and considering more robust error handling in the example usage would further improve the quality of the code. I am unable to directly approve this pull request, and recommend that others review and approve this code before merging. I would recommend that the naming consistency issue be addressed before merging.

@andrewheard andrewheard marked this pull request as ready for review April 7, 2025 22:13
@andrewheard andrewheard requested a review from paulb777 April 7, 2025 22:14
@andrewheard andrewheard marked this pull request as draft April 7, 2025 23:51
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the responseModalities parameter to the GenerationConfig, enabling support for multimodal responses (text and images) using the Gemini 2.0 Flash model. The changes include updates to the GenerationConfig struct, the addition of a ResponseModality enum, and a new integration test to verify image generation. Overall, the code is well-structured and addresses the intended functionality.

Summary of Findings

  • Copyright Year: The copyright year in ResponseModality.swift is set to 2025. Please confirm if this is intentional or if it should be updated to the current year.
  • Test Coverage: The new integration test generateImage includes checks for image dimensions. Consider adding more comprehensive tests to cover different scenarios and edge cases, such as handling errors or invalid image data.
  • Documentation Completeness: The documentation for ResponseModality is good, but consider adding a note about the potential cost implications of requesting image generation, as it might consume more resources than text-only responses.

Merge Readiness

The pull request is well-structured and introduces a valuable feature. However, before merging, please address the copyright year discrepancy and consider enhancing the test coverage as suggested. I am unable to directly approve this pull request, and other reviewers should review and approve this code before merging.

@andrewheard andrewheard marked this pull request as ready for review April 8, 2025 21:24
@andrewheard andrewheard merged commit 7a86f19 into main Apr 9, 2025
36 checks passed
@andrewheard andrewheard deleted the ah/vertex-multimodal-output branch April 9, 2025 00:40
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants