citee-methodology/prompts/README.md

# Prompt Curation Process

> How Citee Index builds and validates the prompt pool per category. The 6-stage process that prevents the "garbage in, garbage out" failure mode.

---

## Why this matters

If the prompt pool is junk ("dyfuzory do włosów ranking", "wąski do samochodu"), the ranking is junk. Prompt quality is the single most important upstream input to ranking integrity.

This process exists to ensure every prompt in the active pool meets two tests:

1. **Real buyer test** — would an actual buyer of this category type this query into ChatGPT/Perplexity?
2. **Reality check** — does this query appear in actual search/discussion data (Google Trends, Reddit, Quora)?

Prompts failing either test are excluded.

## The 6 stages

```
Stage 1: Persona Generator       (AI)
   ↓ 5–10 buyer personas per category
Stage 2: Prompt Brainstormer     (AI per persona)
   ↓ 200–300 raw prompts
Stage 3: Reality Check            (Google Trends / Reddit / Quora / AnswerThePublic)
   ↓ ~150 prompts with verified search demand
Stage 4: Multi-agent Validation  (3 critic agents in parallel)
   ↓ ~120 prompts after critique
Stage 5: Pilot Test Run           (10-prompt sample × 3 models)
   ↓ ~110 prompts that produce stable, sensible AI outputs
Stage 6: Human Approval           (founder + category expert)
   ↓ FINAL POOL: 100 prompts
```

### Stage 1 — Persona Generator

Claude generates 5–10 buyer personas per category. Each persona has:
- Demographics (age, location, income bracket)
- Pain points (what they're trying to solve)
- Decision factors (price, ingredients, brand, reviews, certifications)
- Vocabulary (how they actually talk — formal vs colloquial, technical vs lay)

Example for Świece sojowe PL:
- "30+ kobieta kupująca prezent dla mamy"
- "Self-care millennial 25–35 po pracy"
- "Wnętrzarz minimalistyczne mieszkanie"
- "Mężczyzna kupujący prezent walentynkowy"
- "Mama małych dzieci szukająca bezpiecznego zapachu"

### Stage 2 — Prompt Brainstormer

For each persona, Claude generates 30–50 prompts in the voice of that persona — "how would I phrase this question to ChatGPT?" Total per category: ~200–300 raw prompts.

Distribution target by type (enforced at this stage):
- Buying intent (weight 2.0): 30%
- Comparison (weight 1.5): 25%
- Specific need (weight 1.5): 20%
- Informational (weight 0.3): 15%
- Brand-direct (weight 0.3): 10%

### Stage 3 — Reality Check

Each prompt cross-referenced against real-world data:

| Source | Method | Threshold |
|---|---|---|
| **Google Trends API** | PL queries past 12 months | minimum search volume present |
| **Google Search Console** (where available) | Real search queries to brand sites we have access to | inspirational source for vocabulary |
| **Reddit search** | r/Polska_Marka, niche subreddits | actual user phrasing |
| **Quora PL** | Questions asked in category | real curiosity patterns |
| **AnswerThePublic** | Public scraping of "people also ask" | discovery of long-tail patterns |
| **People Also Ask (Google)** | For top category queries | semantic neighbors |

Prompts with zero/marginal real-world signal are removed. ~300 → ~150.

### Stage 4 — Multi-agent Validation

Three AI critic agents review the list in parallel:

**Agent A — "Real buyer critique"**
Persona-grounded review. Each persona "reads" the prompts and flags ones that don't sound natural for that persona. Prompts marked unnatural by 2+ personas are removed.

**Agent B — "Methodology critic"**
Statistical and structural review. Checks:
- Prompt type distribution stays within ±5% of target
- No subcategory over/under-represented
- Vocabulary diversity (we're not repeating the same phrasing)
- Length distribution reasonable (no 50-word prompts, no 2-word prompts)

**Agent C — "Vendor exploit hunter"**
Anti-gaming review. Identifies prompts that are too easy to game by content marketing fluff:
- Generic informational queries that any vendor can write a blog post for
- Prompts where AI answer is dominated by Wikipedia (vendor can edit Wikipedia)
- Prompts where answer comes from one Reddit post (vendor can write that post)

Each agent produces a list of flagged prompts. Anything flagged by 2+ agents is removed. ~150 → ~120.

### Stage 5 — Pilot Test Run

The ~120 candidate prompts get a sample test:
- Pick 10 prompts (stratified across types)
- Run on ChatGPT-search, Perplexity Sonar, Gemini Pro
- Each prompt × 3 models = 30 outputs

**Reject criteria:**
- AI returns "I don't know" or "this depends on your preferences" (no actionable brand mentions)
- Outputs across 3 models have zero overlap (prompt produces incoherent/random results)
- AI returns a list of countries/categories instead of brands (prompt was misinterpreted)

Prompts failing pilot are flagged for revision or removal. ~120 → ~110.

### Stage 6 — Human Approval

The founder + category expert review the final ~110 candidates and select the production 100.

**Founder always reviews.** For categories outside founder's domain knowledge, a paid expert reviewer (1–2 hours, $50–100) is engaged:

| Category | Expert profile |
|---|---|
| Kosmetyki naturalne | Beauty product manager / freelance marketer |
| Suplementy / nutricosmetyki | Nutritionist / DTC supplement marketer |
| Diety pudełkowe | Fitness coach / dietitian |
| Premium pet food | Pet specialty store owner / dog trainer |
| Kawa specialty | Coffee blogger / barista trainer |
| Czekolada rzemieślnicza | Food blogger / chocolate-focused content creator |
| Kursy programowania | Bootcamp graduate / hiring manager |
| Kliniki estetyczne | Dermatologist or aesthetic medicine consultant |
| Fitness studios | Personal trainer / gym manager |
| Kosmetyki męskie | Men's grooming influencer / DTC marketer |
| Świece sojowe | Founder + JAKULO customer service data |

The final 100 prompts are committed to the closed `prompts/{slug}/` directory (gitignored). A public example framework is committed to `prompts/example-{slug}.md` (this repo) showing the structure and 5–10 illustrative examples per type — but **not the exact production strings**.

## Quarterly refresh — 20% rotation

Every quarter, the curation pipeline runs in refresh mode:

1. **Trend check** — Google Trends API: which prompts have lost relative search volume?
2. **New patterns** — Reddit/Quora scrape: what new question patterns have emerged?
3. **New entrants** — scan model outputs from past quarter: what brands appeared in answers but aren't in our brand catalog?
4. **Generate replacements** — Stages 1–5 for the rotation set
5. **Human approval** — founder reviews the proposed 20 swaps in 5–10 minutes

This prevents Goodhart's Law: as the prompt pool becomes known to vendors (through reverse-engineering or leaks), 20% rotation per quarter ensures vendors can't permanently optimize against our exact queries.

## Cost per category

| Stage | API cost | Human cost |
|---|---|---|
| 1 — Persona Generator | ~$0.50 (Claude) | — |
| 2 — Prompt Brainstormer | ~$1.50 (Claude) | — |
| 3 — Reality Check | $0 (free APIs) | — |
| 4 — Multi-agent Validation | ~$3 (Claude × 3 critics) | — |
| 5 — Pilot Test Run | ~$5 (3 models × 30 outputs) | — |
| 6 — Human Approval | — | ~30 min founder + 1–2h expert ($50–100 for non-founder categories) |
| **Total per category** | **~$10** | **~30 min + $50–100 for expert categories** |

For 11 pilot categories: ~$110 API + ~5 hours founder time + ~$500 expert reviewers.

## Quarterly refresh cost

Per category per quarter: ~$3 API + 5 minutes founder review.

For 11 categories: ~$35 API + 1 hour founder time per quarter.

## Why this is published openly

We publish the **process** because the integrity of the ranking depends on the integrity of the prompts, and external review of the process is the strongest defense against "your prompts are garbage" attack.

We do NOT publish the **exact strings** because Goodhart's Law: known prompts get optimized against, ceasing to measure organic AI search behavior.

The boundary between "open process" and "closed strings" is itself documented openly.