Multimodal Vision Support¶
Zap supports multimodal content, allowing you to send images and other visual content to vision-capable LLM models. This guide covers how to use these features.
Prerequisites¶
- A vision-capable model (e.g.,
anthropic/claude-sonnet-4-20250514,gpt-4o) - Images accessible via URL or base64-encoded
Basic Usage¶
Sending Images with Tasks¶
Use TextContent and ImageContent to create multimodal tasks:
from zap_ai import Zap, ZapAgent, TextContent, ImageContent
# Create a vision-capable agent
vision_agent = ZapAgent(
name="VisionAgent",
prompt="You are a helpful assistant that can analyze images.",
model="anthropic/claude-sonnet-4-20250514", # Vision-capable model
)
zap = Zap(agents=[vision_agent])
await zap.start()
# Send an image with your task
task = await zap.execute_task(
agent_name="VisionAgent",
task_content=[
TextContent(text="What do you see in this image?"),
ImageContent.from_url("https://example.com/photo.jpg"),
],
)
Image Sources¶
Images can be provided from two sources:
URL (recommended for remote images):
img = ImageContent.from_url("https://example.com/image.png")
Base64 (for local images):
import base64
# Read and encode a local image
with open("local_image.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
img = ImageContent.from_base64(image_data, mime_type="image/png")
Multiple Images¶
You can include multiple images in a single task:
task = await zap.execute_task(
agent_name="VisionAgent",
task_content=[
TextContent(text="Compare these two images:"),
ImageContent.from_url("https://example.com/image1.jpg"),
ImageContent.from_url("https://example.com/image2.jpg"),
],
)
Model Capability Validation¶
Zap automatically validates that your agent's model supports vision before processing image content. If you try to send images to a non-vision model, you'll get a VisionNotSupportedError:
from zap_ai import VisionNotSupportedError
# This agent uses a non-vision model
text_agent = ZapAgent(
name="TextAgent",
prompt="You are a text-only assistant.",
model="gpt-3.5-turbo", # Does NOT support vision
)
try:
task = await zap.execute_task(
agent_name="TextAgent",
task_content=[
TextContent(text="Describe this:"),
ImageContent.from_url("https://example.com/photo.jpg"),
],
)
except VisionNotSupportedError as e:
print(f"Model doesn't support vision: {e}")
Advanced Options¶
Image Detail Level¶
For supported models, you can specify the detail level for image processing:
# High detail for detailed analysis
img = ImageContent.from_url(
"https://example.com/document.png",
detail="high" # Options: "auto", "low", "high"
)
# Low detail for faster processing
img = ImageContent.from_url(
"https://example.com/thumbnail.png",
detail="low"
)
Explicit MIME Type¶
For cloud storage URLs (like gs://), you may need to specify the format explicitly:
img = ImageContent(
url="gs://my-bucket/image.webp",
format="image/webp"
)
Working with Messages¶
The Message class provides helper methods for multimodal content:
from zap_ai.llm.message_types import Message
# Create a user message with images
msg = Message.user_with_images(
"What's in this image?",
[ImageContent.from_url("https://example.com/photo.jpg")]
)
# Check if a message has images
if msg.has_images():
print("Message contains images")
# Extract text content (ignoring images)
text = msg.get_text_content()
Sub-Agent Communication¶
When using sub-agents, you can pass multimodal content through the message_agent tool:
# The parent agent can pass images to sub-agents
# via the message_agent tool with multimodal content
coordinator = ZapAgent(
name="Coordinator",
prompt="You coordinate with specialized vision agents.",
model="anthropic/claude-sonnet-4-20250514",
sub_agents=["ImageAnalyzer"],
)
The sub-agent will receive the full multimodal content, including any images.
Best Practices¶
- Use vision-capable models: Check LiteLLM's documentation for supported vision models
- Prefer URLs for remote images: Base64 encoding increases payload size significantly
- Consider detail levels: Use
"low"for thumbnails,"high"for documents requiring text extraction - Handle errors gracefully: Catch
VisionNotSupportedErrorwhen model support is uncertain
Example¶
See the complete example in examples/multimodal_vision/main.py:
cd examples/multimodal_vision
python main.py