This specification describes the functional requirements, architecture considerations, data handling details, error handling approaches, and a testing plan for creating a single-page web app that takes an image, obtains bounding box data from an AI (LLM or computer vision model), and validates the returned data against a custom JSON Schema.
You want a single-page HTML web application (SPA) that:
- Lets a user upload or provide an image.
- Sends the image to one or more Large Language Models (LLMs) or computer vision services (collectively, "the AI") for analysis.
- Receives a JSON response describing bounding boxes within the image.
- Validates the JSON response against a custom bounding box schema (which enforces normalized coordinates, mandatory labels with confidence scores, etc.).
- Displays the validation results (pass/fail, any errors) and potentially the bounding box overlays on the image for manual inspection.
- Supports multiple AI models (OpenAI GPT-4o, Google Gemini, Anthropic Claude 3).
- Creates heat map visualizations that aggregate results from multiple AI queries.
- Goal: Evaluate and compare how different LLMs/CV models perform image analysis tasks by returning bounding boxes (along with descriptive labels).
- Scenario: The user uploads an image (e.g., "excited dog running in grass"). The system calls the AI, which returns bounding box data in a JSON structure conforming to the schema below, then checks for schema validity and presents the results.
- Comparison: The app generates multiple responses from the same AI model to analyze consistency and renders an aggregated heat map visualization.
-
Single-Page Interface
- A minimal HTML page with embedded JavaScript.
- File/image upload functionality (or drag-and-drop).
- A place to show the validated JSON output and/or any error messages.
- Model selector dropdown for choosing different AI providers.
- API key input field that stores keys locally in the browser.
- Response selector to view individual AI responses.
-
Image Upload and AI Request
- The user selects an image (JPEG, PNG, etc.).
- The app sends this image (or a reference/URL to it) to an AI service endpoint.
- Supports OpenAI GPT-4o, Google Gemini, and Anthropic Claude 3 models.
- Makes multiple queries (5 by default) to the same model for consistency analysis.
-
JSON Schema Validation
- The AI responds with JSON describing bounding boxes.
- The app validates that JSON against the agreed-upon schema using Ajv.
- The result of the validation is displayed to the user (success, or a list of validation errors).
-
Custom Bounding Box Schema
- Must require
width
,height
,bounding_boxes
andcontextual_attributes
inimage
. width
/height
: integers.bounding_boxes
: an array of 0 or more bounding box objects.- Each bounding box has
x_min
,y_min
,x_max
,y_max
as integer pixel coordinates. - Each box must have a non-empty
labels
array.- Each label has
name
(string) andconfidence
(integer in[0, 100]
).
- Each label has
- Each bounding box has
contextual_attributes
: an array of objects withattribute
anddescription
strings.
- Must require
-
Visualization Features
- Bounding boxes displayed as overlays on the image with different colors.
- Heat map visualization showing aggregated results from multiple queries.
- Ability to switch between individual AI responses for comparison.
- Labels displayed next to bounding boxes with confidence scores.
-
Flexibility and Extensibility
- The schema allows additional properties beyond those required.
- The user might pass extra data, but the core required fields must conform to the schema.
-
API Key Management
- API keys stored locally in browser's localStorage.
- Different key fields for each supported AI provider.
- Dynamic labeling based on selected model.
A Front-End Only SPA that:
-
HTML + JavaScript:
- Renders a form for selecting an image and AI model.
- Manages API keys for different providers.
- On form submission, calls the selected AI service.
-
AI Integration:
- Uses the LangChain library for standardized integration with multiple AI providers.
- Sends the image as base64 to the selected AI model.
- Makes multiple queries (default: 5) for consistency analysis.
- Receives JSON responses with bounding box data.
-
JSON Validation:
- Uses Ajv for JSON Schema validation.
- Validates the AI's JSON against the custom bounding box schema.
- Displays validation results.
-
Visualization:
- Draws bounding boxes with labels on a canvas overlay.
- Generates heat map visualizations showing consistency across multiple responses.
- Applies color coding for different object types.
-
User Action
- The user selects an AI model and provides their API key.
- The user selects an image file from their device.
-
Front-End Processing
- Converts image to base64 and sends it to the selected AI service.
- Makes multiple identical queries to analyze consistency.
-
AI Response
The AI returns JSON structures that should conform to this schema:
{
"image": {
"width": 1024,
"height": 768,
"bounding_boxes": [
{
"x_min": 200,
"y_min": 100,
"x_max": 400,
"y_max": 300,
"labels": [
{ "name": "dog", "confidence": 95 },
{ "name": "running", "confidence": 80 }
]
}
// ... more bounding boxes ...
],
"contextual_attributes": [
{ "attribute": "weather", "description": "Sunny day" },
{ "attribute": "location", "description": "Public park" }
]
}
}
-
Validation
- Uses the JSON Schema to validate each response.
- Filters out invalid responses.
-
Visualization
- Renders individual bounding boxes for each valid response.
- Generates a heat map overlay showing agreement across multiple queries.
- Allows switching between individual responses.
-
AI Call Failures
- If the AI endpoint is unreachable or returns an error code, skips that response.
- Continues processing other successful responses.
-
Malformed JSON
- Attempts to clean JSON from markdown code blocks if needed.
- If the response is not valid JSON, skips that response.
-
Schema Validation Failures
- If the JSON doesn't validate against the schema, logs errors.
- Only processes valid responses for visualization.
-
Partial Success Strategy
- If some queries succeed and others fail, works with the successful ones.
- Shows a count of successful vs. total queries.
The implemented schema includes support for bounding boxes and contextual attributes:
{
"$id": "https://example.com/my-bbox-schema.json",
"type": "object",
"title": "Image Bounding Box Schema",
"properties": {
"image": {
"type": "object",
"properties": {
"width": { "type": "integer", "minimum": 1 },
"height": { "type": "integer", "minimum": 1 },
"bounding_boxes": {
"type": "array",
"items": {
"type": "object",
"properties": {
"x_min": { "type": "integer", "minimum": 0 },
"y_min": { "type": "integer", "minimum": 0 },
"x_max": { "type": "integer", "minimum": 0 },
"y_max": { "type": "integer", "minimum": 0 },
"labels": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"confidence": { "type": "integer", "minimum": 0, "maximum": 100 }
},
"required": ["name", "confidence"],
"additionalProperties": true
}
}
},
"required": ["x_min", "y_min", "x_max", "y_max", "labels"],
"additionalProperties": true
}
},
"contextual_attributes": {
"type": "array",
"items": {
"type": "object",
"properties": {
"attribute": { "type": "string" },
"description": { "type": "string" }
},
"required": ["attribute", "description"],
"additionalProperties": true
}
}
},
"required": ["width", "height", "bounding_boxes", "contextual_attributes"],
"additionalProperties": true
}
},
"required": ["image"],
"additionalProperties": true
}
- Using Ajv for JSON Schema validation with specific options to handle potential issues:
const ajv = new Ajv({ strict: false, strictSchema: false, strictTypes: false, validateSchema: false });
The application implements a heat map visualization that:
- Makes multiple queries (5 by default) to the same AI model.
- Aggregates responses into a grid-based heat map.
- Applies a Gaussian blur for smoother visualization.
- Uses different colors for different object labels.
- Shows transparency based on confidence/agreement level.
This helps visualize:
- Areas where the AI consistently detects objects
- Differences in boundary precision across queries
- Confidence levels for different parts of the image
The application supports these AI models:
- OpenAI GPT-4o - Using the ChatOpenAI class from LangChain
- Google Gemini - Using the ChatGoogleGenerativeAI class from LangChain
- Anthropic Claude 3 - Using the ChatAnthropic class from LangChain
Each model requires its own API key, which is:
- Stored securely in the browser's localStorage
- Not transmitted except to the respective AI provider
- Referenced by a unique key based on the model name
-
Testing Considerations
- Test with various image types and sizes
- Compare responses across different AI models
- Validate heat map generation with different sample counts
-
Potential Enhancements
- Save/export analyzed images with bounding boxes
- Add additional AI model providers
- Include confidence thresholds for filtering results
- Implement side-by-side comparison between models
- Add annotation tools to correct or refine AI-generated boxes