"Which model should we use?"
I get asked this question at least once a day. And the honest answer is: it depends. But that's not helpful, so let me give you the framework I actually use.
The Four Dimensions of Model Selection
Every model decision comes down to four factors:
1. Capability Match
Does the model have the skills your task requires?
2. Cost Economics
What's the total cost of ownership?
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Typical Monthly Cost* |
|---|---|---|---|
| Claude Opus 4 | $15 | $75 | $12,000 |
| Claude Sonnet 4 | $3 | $15 | $2,400 |
| GPT-4o | $5 | $15 | $3,200 |
| Gemini 1.5 Pro | $3.50 | $10.50 | $2,100 |
| Mistral Large | $4 | $12 | $2,500 |
*Based on 100k requests/month, 1k tokens in, 500 tokens out
3. Operational Requirements
Can you actually run this in production?
- •Latency requirements - Real-time vs. batch
- •Uptime SLAs - 99.9% vs. best effort
- •Data residency - Where can data go?
- •Compliance - SOC 2, HIPAA, etc.
4. Strategic Fit
Longer-term considerations:
- •Vendor lock-in risk - How portable is your implementation?
- •Model roadmap - Where is the provider heading?
- •Support quality - Who helps when things break?
The Selection Process
Here's my step-by-step approach:
Step 1: Define Success Criteria
Before looking at any model, write down your must-haves, nice-to-haves, and budget constraints.
Step 2: Create a Test Suite
Build a representative evaluation set with examples from each category of tasks you need to handle.
Step 3: Run Comparative Evaluation
This is where ModelMix shines. Run the same inputs through multiple models:
Step 4: Score and Decide
Use a weighted scoring matrix to make the final decision.
Common Mistakes to Avoid
Mistake 1: Benchmark Worship
Public benchmarks are directionally useful but don't reflect your specific use case. Always test on your own data.
Mistake 2: Ignoring Variance
A model that scores 90% average but has high variance might be worse than one that scores 85% consistently.
Mistake 3: Optimizing for Today
The model landscape changes fast. Build for flexibility by abstracting the model layer.
Video: Model Selection in Practice
*Watch me evaluate models for a real client use case*
Quick Decision Guide
When you need a fast answer:
| If you need... | Consider first... |
|---|---|
| Best reasoning | Claude Opus 4 |
| Best value | Claude Sonnet 4 |
| Fastest responses | Gemini Flash |
| Best vision | GPT-4o |
| Longest context | Gemini 1.5 Pro |
| Open weights | Mistral Large |
| Lowest cost | Claude Haiku 3.5 |
But remember: always validate for your specific use case.
The ModelMix Workflow
Here's how I use ModelMix for model selection:
- Input my evaluation prompts - One by one or in batch
- Select candidate models - Usually 3-4 finalists
- Compare outputs side-by-side - Quality, style, accuracy
- Export results - For documentation and stakeholder review
- Iterate - Refine prompts, retest edge cases
This process takes hours, not weeks. And the decision is data-driven, not gut-driven.
*What factors do you prioritize in model selection? Let me know on Twitter.*
Charles Kim
Conversational AI Lead at HelloFresh
Charles Kim brings 20+ years of technology experience to the AI space. Currently leading conversational AI initiatives at HelloFresh, he's passionate about vibe coding and generative AI—especially its broad applications across modalities. From enterprise systems to cutting-edge AI tools, Charles explores how technology can transform the way we work and create.