Files
architecture/old2/skills/reference/model-selection.md
2026-01-15 17:28:06 +01:00

337 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Model Selection Guide
Detailed guidance on choosing the right model for skills and agents.
## Cost Comparison
| Model | Input (per MTok) | Output (per MTok) | vs Haiku |
|-------|------------------|-------------------|----------|
| **Haiku** | $0.25 | $1.25 | Baseline |
| **Sonnet** | $3.00 | $15.00 | 12x more expensive |
| **Opus** | $15.00 | $75.00 | 60x more expensive |
**Example cost for typical skill call (2K input, 1K output):**
- Haiku: $0.00175
- Sonnet: $0.021 (12x more)
- Opus: $0.105 (60x more)
## Speed Comparison
| Model | Tokens/Second | vs Haiku |
|-------|---------------|----------|
| **Haiku** | ~100 | Baseline |
| **Sonnet** | ~40 | 2.5x slower |
| **Opus** | ~20 | 5x slower |
## Decision Framework
```
Start with Haiku by default
|
v
Test on 3-5 representative tasks
|
+-- Success rate ≥80%? ---------> ✓ Use Haiku
| (12x cheaper, 2-5x faster)
|
+-- Success rate <80%? --------> Try Sonnet
| |
| v
| Test on same tasks
| |
| +-- Success ≥80%? --> Use Sonnet
| |
| +-- Still failing? --> Opus or redesign
|
v
Document why you chose the model
```
## When Haiku Works Well
### ✓ Ideal for Haiku
**Simple sequential workflows:**
- `/dashboard` - Fetch and display
- `/roadmap` - List and format
- `/commit` - Generate message from diff
**Workflows with scripts:**
- Error-prone operations in scripts
- Skills just orchestrate script calls
- Validation is deterministic
**Structured outputs:**
- Tasks with clear templates
- Format is defined upfront
- No ambiguous formatting
**Reference/knowledge skills:**
- `gitea` - CLI reference
- `issue-writing` - Patterns and templates
- `software-architecture` - Best practices
### Examples of Haiku Success
**work-issue skill:**
- Sequential steps (view → branch → plan → implement → PR)
- Each step has clear validation
- Scripts handle error-prone operations
- Success rate: ~90%
**dashboard skill:**
- Fetch data (tea commands)
- Format as table
- Clear, structured output
- Success rate: ~95%
## When to Use Sonnet
### Use Sonnet When
**Haiku fails 20%+ of the time**
- Test with Haiku first
- If success rate <80%, upgrade to Sonnet
**Complex judgment required:**
- Code review (quality assessment)
- Issue grooming (clarity evaluation)
- Architecture decisions
**Nuanced reasoning:**
- Understanding implicit requirements
- Making trade-off decisions
- Applying context-dependent rules
### Examples of Sonnet Success
**review-pr skill:**
- Requires code understanding
- Judgment about quality/bugs
- Context-dependent feedback
- Originally tried Haiku: 65% success → Sonnet: 85%
**issue-worker agent:**
- Autonomous implementation
- Pattern matching
- Architectural decisions
- Originally tried Haiku: 70% success → Sonnet: 82%
## When to Use Opus
### Reserve Opus For
**Deep architectural reasoning:**
- `software-architect` agent
- Pattern recognition across large codebases
- Identifying subtle anti-patterns
- Trade-off analysis
**High-stakes decisions:**
- Breaking changes analysis
- System-wide refactoring plans
- Security architecture review
**Complex pattern recognition:**
- Requires sophisticated understanding
- Multiple layers of abstraction
- Long-term implications
### Examples of Opus Success
**software-architect agent:**
- Analyzes entire codebase
- Identifies 8 different anti-patterns
- Provides prioritized recommendations
- Sonnet: 68% success → Opus: 88%
**arch-review-repo skill:**
- Comprehensive architecture audit
- Cross-cutting concerns
- System-wide patterns
- Opus justified for depth
## Making Haiku More Effective
If Haiku is struggling, try these improvements **before** upgrading to Sonnet:
### 1. Add Validation Steps
**Instead of:**
```markdown
3. Implement changes and create PR
```
**Try:**
```markdown
3. Implement changes
4. Validate: Run `./scripts/validate.sh` (tests pass, linter clean)
5. Create PR: `./scripts/create-pr.sh`
```
### 2. Bundle Error-Prone Operations in Scripts
**Instead of:**
```markdown
5. Create PR: `tea pulls create --title "..." --description "..."`
```
**Try:**
```markdown
5. Create PR: `./scripts/create-pr.sh $issue "$title"`
```
### 3. Add Structured Output Templates
**Instead of:**
```markdown
Show the results
```
**Try:**
```markdown
Format results as:
| Issue | Status | Link |
|-------|--------|------|
| ... | ... | ... |
```
### 4. Add Explicit Checklists
**Instead of:**
```markdown
Review the code for quality
```
**Try:**
```markdown
Check:
- [ ] Code quality (readability, naming)
- [ ] Bugs (edge cases, null checks)
- [ ] Tests (coverage, assertions)
```
### 5. Make Instructions More Concise
**Instead of:**
```markdown
Git is a version control system. When you want to commit changes, you use the git commit command which saves your changes to the repository...
```
**Try:**
```markdown
`git commit -m 'feat: add feature'`
```
## Testing Methodology
### Create Test Suite
For each skill, create 3-5 test cases:
**Example: work-issue skill tests**
1. Simple bug fix issue
2. New feature with acceptance criteria
3. Issue missing acceptance criteria
4. Issue with tests that fail
5. Complex refactoring task
### Test with Haiku
```bash
# Set skill to Haiku
model: haiku
# Run all 5 tests
# Document success/failure for each
```
### Measure Success Rate
```
Success rate = (Successful tests / Total tests) × 100
```
**Decision:**
- ≥80% → Keep Haiku
- <80% → Try Sonnet
- <50% → Likely need Opus or redesign
### Test with Sonnet (if needed)
```bash
# Upgrade to Sonnet
model: sonnet
# Run same 5 tests
# Compare results
```
### Document Decision
```yaml
---
name: work-issue
model: haiku # Tested: 4/5 tests passed with Haiku (80%)
---
```
Or:
```yaml
---
name: review-pr
model: sonnet # Tested: Haiku 3/5 (60%), Sonnet 4/5 (80%)
---
```
## Common Patterns
### Pattern: Start Haiku, Upgrade if Needed
**Issue-worker agent evolution:**
1. **V1 (Haiku):** 70% success - struggled with pattern matching
2. **Analysis:** Added more examples, still 72%
3. **V2 (Sonnet):** 82% success - better code understanding
4. **Decision:** Keep Sonnet, document why
### Pattern: Haiku for Most, Sonnet for Complex
**Review-pr skill:**
- Static analysis steps: Haiku could handle
- Manual code review: Needs Sonnet judgment
- **Decision:** Use Sonnet for whole skill (simplicity)
### Pattern: Split Complex Skills
**Instead of:** One complex skill using Opus
**Try:** Split into:
- Haiku skill for orchestration
- Sonnet agent for complex subtask
- Saves cost (most work in Haiku)
## Model Selection Checklist
Before choosing a model:
- [ ] Tested with Haiku first
- [ ] Measured success rate on 3-5 test cases
- [ ] Tried improvements (scripts, validation, checklists)
- [ ] Documented why this model is needed
- [ ] Considered cost implications (12x/60x)
- [ ] Considered speed implications (2.5x/5x slower)
- [ ] Will re-test if Claude models improve
## Future-Proofing
**Models improve over time.**
Periodically re-test Sonnet/Opus skills with Haiku:
- Haiku v2 might handle what Haiku v1 couldn't
- Cost savings compound over time
- Speed improvements are valuable
**Set a reminder:** Test Haiku again in 3-6 months.