Model Overview
Grok-2-Vision is xAI's multimodal model capable of understanding both text and images, designed for comprehensive visual analysis and reasoning tasks.
Key Features
- High intelligence (3/4 dots rating)
- Medium speed (3/5 lightning bolts rating)
- 8,192 context window
- Medium max output tokens (estimated 4,096)
- 2024 knowledge cutoff (estimated)
- Text and image input support
- Text output support
Technical Specifications
- Pricing: $2.00 per 1M tokens (text input), $2.00 per 1M tokens (image input), $10.00 per 1M tokens (output)
- Supports: Input: text and images (JPG/JPEG, PNG, max 10MiB per image); Output: text only
- Features: Vision understanding, multimodal reasoning, image analysis
Snapshots
- grok-2-vision-1212
- grok-2-vision (alias for grok-2-vision-latest)
- grok-2-vision-latest
Positioning and Use Cases
Grok-2-Vision excels at visual understanding tasks including image description, visual question answering, document analysis, chart interpretation, and multimodal reasoning. It can process unlimited numbers of images alongside text prompts, making it ideal for applications requiring comprehensive visual analysis, content moderation, educational materials review, and complex visual reasoning tasks.
Rate Limits
- Information not publicly available
Additional Technical Notes
- Image Input Specifications: Maximum 10MiB per image, unlimited number of images, supports JPG/JPEG and PNG formats
- Flexible Input Order: Text and image inputs can be mixed in any order within conversations
- Model Versioning: Date-specific versions (e.g., -1212) provide consistency, while aliases auto-update to latest versions
- Context Limitations: Grok-2-Vision has smaller context window (8K) compared to other models (131K)
- Pricing Structure: Image generation uses per-image pricing, while text models use token-based pricing
Documentation
Official Documentation