architecture/old2/skills/reference/model-selection.md

# Model Selection Guide

Detailed guidance on choosing the right model for skills and agents.

## Cost Comparison

| Model | Input (per MTok) | Output (per MTok) | vs Haiku |
|-------|------------------|-------------------|----------|
| **Haiku** | $0.25 | $1.25 | Baseline |
| **Sonnet** | $3.00 | $15.00 | 12x more expensive |
| **Opus** | $15.00 | $75.00 | 60x more expensive |

**Example cost for typical skill call (2K input, 1K output):**
- Haiku: $0.00175
- Sonnet: $0.021 (12x more)
- Opus: $0.105 (60x more)

## Speed Comparison

| Model | Tokens/Second | vs Haiku |
|-------|---------------|----------|
| **Haiku** | ~100 | Baseline |
| **Sonnet** | ~40 | 2.5x slower |
| **Opus** | ~20 | 5x slower |

## Decision Framework

```
Start with Haiku by default
    |
    v
Test on 3-5 representative tasks
    |
    +-- Success rate ≥80%? ---------> ✓ Use Haiku
    |                                  (12x cheaper, 2-5x faster)
    |
    +-- Success rate <80%? --------> Try Sonnet
    |                                    |
    |                                    v
    |                              Test on same tasks
    |                                    |
    |                                    +-- Success ≥80%? --> Use Sonnet
    |                                    |
    |                                    +-- Still failing? --> Opus or redesign
    |
    v
Document why you chose the model
```

## When Haiku Works Well

### ✓ Ideal for Haiku

**Simple sequential workflows:**
- `/dashboard` - Fetch and display
- `/roadmap` - List and format
- `/commit` - Generate message from diff

**Workflows with scripts:**
- Error-prone operations in scripts
- Skills just orchestrate script calls
- Validation is deterministic

**Structured outputs:**
- Tasks with clear templates
- Format is defined upfront
- No ambiguous formatting

**Reference/knowledge skills:**
- `gitea` - CLI reference
- `issue-writing` - Patterns and templates
- `software-architecture` - Best practices

### Examples of Haiku Success

**work-issue skill:**
- Sequential steps (view → branch → plan → implement → PR)
- Each step has clear validation
- Scripts handle error-prone operations
- Success rate: ~90%

**dashboard skill:**
- Fetch data (tea commands)
- Format as table
- Clear, structured output
- Success rate: ~95%

## When to Use Sonnet

### Use Sonnet When

**Haiku fails 20%+ of the time**
- Test with Haiku first
- If success rate <80%, upgrade to Sonnet

**Complex judgment required:**
- Code review (quality assessment)
- Issue grooming (clarity evaluation)
- Architecture decisions

**Nuanced reasoning:**
- Understanding implicit requirements
- Making trade-off decisions
- Applying context-dependent rules

### Examples of Sonnet Success

**review-pr skill:**
- Requires code understanding
- Judgment about quality/bugs
- Context-dependent feedback
- Originally tried Haiku: 65% success → Sonnet: 85%

**issue-worker agent:**
- Autonomous implementation
- Pattern matching
- Architectural decisions
- Originally tried Haiku: 70% success → Sonnet: 82%

## When to Use Opus

### Reserve Opus For

**Deep architectural reasoning:**
- `software-architect` agent
- Pattern recognition across large codebases
- Identifying subtle anti-patterns
- Trade-off analysis

**High-stakes decisions:**
- Breaking changes analysis
- System-wide refactoring plans
- Security architecture review

**Complex pattern recognition:**
- Requires sophisticated understanding
- Multiple layers of abstraction
- Long-term implications

### Examples of Opus Success

**software-architect agent:**
- Analyzes entire codebase
- Identifies 8 different anti-patterns
- Provides prioritized recommendations
- Sonnet: 68% success → Opus: 88%

**arch-review-repo skill:**
- Comprehensive architecture audit
- Cross-cutting concerns
- System-wide patterns
- Opus justified for depth

## Making Haiku More Effective

If Haiku is struggling, try these improvements **before** upgrading to Sonnet:

### 1. Add Validation Steps

**Instead of:**
```markdown
3. Implement changes and create PR
```

**Try:**
```markdown
3. Implement changes
4. Validate: Run `./scripts/validate.sh` (tests pass, linter clean)
5. Create PR: `./scripts/create-pr.sh`
```

### 2. Bundle Error-Prone Operations in Scripts

**Instead of:**
```markdown
5. Create PR: `tea pulls create --title "..." --description "..."`
```

**Try:**
```markdown
5. Create PR: `./scripts/create-pr.sh $issue "$title"`
```

### 3. Add Structured Output Templates

**Instead of:**
```markdown
Show the results
```

**Try:**
```markdown
Format results as:

| Issue | Status | Link |
|-------|--------|------|
| ... | ... | ... |
```

### 4. Add Explicit Checklists

**Instead of:**
```markdown
Review the code for quality
```

**Try:**
```markdown
Check:
- [ ] Code quality (readability, naming)
- [ ] Bugs (edge cases, null checks)
- [ ] Tests (coverage, assertions)
```

### 5. Make Instructions More Concise

**Instead of:**
```markdown
Git is a version control system. When you want to commit changes, you use the git commit command which saves your changes to the repository...
```

**Try:**
```markdown
`git commit -m 'feat: add feature'`
```

## Testing Methodology

### Create Test Suite

For each skill, create 3-5 test cases:

**Example: work-issue skill tests**
1. Simple bug fix issue
2. New feature with acceptance criteria
3. Issue missing acceptance criteria
4. Issue with tests that fail
5. Complex refactoring task

### Test with Haiku

```bash
# Set skill to Haiku
model: haiku

# Run all 5 tests
# Document success/failure for each
```

### Measure Success Rate

```
Success rate = (Successful tests / Total tests) × 100
```

**Decision:**
- ≥80% → Keep Haiku
- <80% → Try Sonnet
- <50% → Likely need Opus or redesign

### Test with Sonnet (if needed)

```bash
# Upgrade to Sonnet
model: sonnet

# Run same 5 tests
# Compare results
```

### Document Decision

```yaml
---
name: work-issue
model: haiku  # Tested: 4/5 tests passed with Haiku (80%)
---
```

Or:

```yaml
---
name: review-pr
model: sonnet  # Tested: Haiku 3/5 (60%), Sonnet 4/5 (80%)
---
```

## Common Patterns

### Pattern: Start Haiku, Upgrade if Needed

**Issue-worker agent evolution:**
1. **V1 (Haiku):** 70% success - struggled with pattern matching
2. **Analysis:** Added more examples, still 72%
3. **V2 (Sonnet):** 82% success - better code understanding
4. **Decision:** Keep Sonnet, document why

### Pattern: Haiku for Most, Sonnet for Complex

**Review-pr skill:**
- Static analysis steps: Haiku could handle
- Manual code review: Needs Sonnet judgment
- **Decision:** Use Sonnet for whole skill (simplicity)

### Pattern: Split Complex Skills

**Instead of:** One complex skill using Opus

**Try:** Split into:
- Haiku skill for orchestration
- Sonnet agent for complex subtask
- Saves cost (most work in Haiku)

## Model Selection Checklist

Before choosing a model:

- [ ] Tested with Haiku first
- [ ] Measured success rate on 3-5 test cases
- [ ] Tried improvements (scripts, validation, checklists)
- [ ] Documented why this model is needed
- [ ] Considered cost implications (12x/60x)
- [ ] Considered speed implications (2.5x/5x slower)
- [ ] Will re-test if Claude models improve

## Future-Proofing

**Models improve over time.**

Periodically re-test Sonnet/Opus skills with Haiku:
- Haiku v2 might handle what Haiku v1 couldn't
- Cost savings compound over time
- Speed improvements are valuable

**Set a reminder:** Test Haiku again in 3-6 months.