citee-methodology/prompts/README.md
Jacek Kubas f76cf2858b v1.0.0 — initial Citee Index Methodology release
Foundational public methodology for the first open public ranking of brand
visibility in AI search results (ChatGPT, Perplexity, Gemini, Claude).

This release establishes the framework — no rankings have been computed
or published yet. First scan cycle: late May 2026 (private validation).
First public ranking publication target: August 2026, after 3 validation
cycles.

Includes:
- methodology.json: machine-readable formulas, weights, policies
- README.md: human-readable overview + open/closed boundary
- CHANGELOG.md: versioning policy + v1.0.0 release notes
- taxonomy.md: tier system + 11 PL pilot categories
- LICENSE: MIT
- .gitignore: closed operational data (exact prompts, anti-gaming thresholds)
- prompts/README.md: 6-stage prompt curation process
- prompts/example-swiece-sojowe-pl.md: illustrative framework for first category

Strategic principles:
- Algorithm-first, no advisory board
- Open methodology + closed exact prompts (Goodhart's Law defense)
- No retroactive changes (FIDE 2024 lesson)
- No pay-to-play, hard rule (Moody's / Forbes 30 Under 30 lessons)
- Subjective opinion disclaimer (Gartner v. NetScout 2020 First Amendment shield)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:25:56 +02:00

8 KiB
Raw Blame History

Prompt Curation Process

How Citee Index builds and validates the prompt pool per category. The 6-stage process that prevents the "garbage in, garbage out" failure mode.


Why this matters

If the prompt pool is junk ("dyfuzory do włosów ranking", "wąski do samochodu"), the ranking is junk. Prompt quality is the single most important upstream input to ranking integrity.

This process exists to ensure every prompt in the active pool meets two tests:

  1. Real buyer test — would an actual buyer of this category type this query into ChatGPT/Perplexity?
  2. Reality check — does this query appear in actual search/discussion data (Google Trends, Reddit, Quora)?

Prompts failing either test are excluded.

The 6 stages

Stage 1: Persona Generator       (AI)
   ↓ 510 buyer personas per category
Stage 2: Prompt Brainstormer     (AI per persona)
   ↓ 200300 raw prompts
Stage 3: Reality Check            (Google Trends / Reddit / Quora / AnswerThePublic)
   ↓ ~150 prompts with verified search demand
Stage 4: Multi-agent Validation  (3 critic agents in parallel)
   ↓ ~120 prompts after critique
Stage 5: Pilot Test Run           (10-prompt sample × 3 models)
   ↓ ~110 prompts that produce stable, sensible AI outputs
Stage 6: Human Approval           (founder + category expert)
   ↓ FINAL POOL: 100 prompts

Stage 1 — Persona Generator

Claude generates 510 buyer personas per category. Each persona has:

  • Demographics (age, location, income bracket)
  • Pain points (what they're trying to solve)
  • Decision factors (price, ingredients, brand, reviews, certifications)
  • Vocabulary (how they actually talk — formal vs colloquial, technical vs lay)

Example for Świece sojowe PL:

  • "30+ kobieta kupująca prezent dla mamy"
  • "Self-care millennial 2535 po pracy"
  • "Wnętrzarz minimalistyczne mieszkanie"
  • "Mężczyzna kupujący prezent walentynkowy"
  • "Mama małych dzieci szukająca bezpiecznego zapachu"

Stage 2 — Prompt Brainstormer

For each persona, Claude generates 3050 prompts in the voice of that persona — "how would I phrase this question to ChatGPT?" Total per category: ~200300 raw prompts.

Distribution target by type (enforced at this stage):

  • Buying intent (weight 2.0): 30%
  • Comparison (weight 1.5): 25%
  • Specific need (weight 1.5): 20%
  • Informational (weight 0.3): 15%
  • Brand-direct (weight 0.3): 10%

Stage 3 — Reality Check

Each prompt cross-referenced against real-world data:

Source Method Threshold
Google Trends API PL queries past 12 months minimum search volume present
Google Search Console (where available) Real search queries to brand sites we have access to inspirational source for vocabulary
Reddit search r/Polska_Marka, niche subreddits actual user phrasing
Quora PL Questions asked in category real curiosity patterns
AnswerThePublic Public scraping of "people also ask" discovery of long-tail patterns
People Also Ask (Google) For top category queries semantic neighbors

Prompts with zero/marginal real-world signal are removed. ~300 → ~150.

Stage 4 — Multi-agent Validation

Three AI critic agents review the list in parallel:

Agent A — "Real buyer critique" Persona-grounded review. Each persona "reads" the prompts and flags ones that don't sound natural for that persona. Prompts marked unnatural by 2+ personas are removed.

Agent B — "Methodology critic" Statistical and structural review. Checks:

  • Prompt type distribution stays within ±5% of target
  • No subcategory over/under-represented
  • Vocabulary diversity (we're not repeating the same phrasing)
  • Length distribution reasonable (no 50-word prompts, no 2-word prompts)

Agent C — "Vendor exploit hunter" Anti-gaming review. Identifies prompts that are too easy to game by content marketing fluff:

  • Generic informational queries that any vendor can write a blog post for
  • Prompts where AI answer is dominated by Wikipedia (vendor can edit Wikipedia)
  • Prompts where answer comes from one Reddit post (vendor can write that post)

Each agent produces a list of flagged prompts. Anything flagged by 2+ agents is removed. ~150 → ~120.

Stage 5 — Pilot Test Run

The ~120 candidate prompts get a sample test:

  • Pick 10 prompts (stratified across types)
  • Run on ChatGPT-search, Perplexity Sonar, Gemini Pro
  • Each prompt × 3 models = 30 outputs

Reject criteria:

  • AI returns "I don't know" or "this depends on your preferences" (no actionable brand mentions)
  • Outputs across 3 models have zero overlap (prompt produces incoherent/random results)
  • AI returns a list of countries/categories instead of brands (prompt was misinterpreted)

Prompts failing pilot are flagged for revision or removal. ~120 → ~110.

Stage 6 — Human Approval

The founder + category expert review the final ~110 candidates and select the production 100.

Founder always reviews. For categories outside founder's domain knowledge, a paid expert reviewer (12 hours, $50100) is engaged:

Category Expert profile
Kosmetyki naturalne Beauty product manager / freelance marketer
Suplementy / nutricosmetyki Nutritionist / DTC supplement marketer
Diety pudełkowe Fitness coach / dietitian
Premium pet food Pet specialty store owner / dog trainer
Kawa specialty Coffee blogger / barista trainer
Czekolada rzemieślnicza Food blogger / chocolate-focused content creator
Kursy programowania Bootcamp graduate / hiring manager
Kliniki estetyczne Dermatologist or aesthetic medicine consultant
Fitness studios Personal trainer / gym manager
Kosmetyki męskie Men's grooming influencer / DTC marketer
Świece sojowe Founder + JAKULO customer service data

The final 100 prompts are committed to the closed prompts/{slug}/ directory (gitignored). A public example framework is committed to prompts/example-{slug}.md (this repo) showing the structure and 510 illustrative examples per type — but not the exact production strings.

Quarterly refresh — 20% rotation

Every quarter, the curation pipeline runs in refresh mode:

  1. Trend check — Google Trends API: which prompts have lost relative search volume?
  2. New patterns — Reddit/Quora scrape: what new question patterns have emerged?
  3. New entrants — scan model outputs from past quarter: what brands appeared in answers but aren't in our brand catalog?
  4. Generate replacements — Stages 15 for the rotation set
  5. Human approval — founder reviews the proposed 20 swaps in 510 minutes

This prevents Goodhart's Law: as the prompt pool becomes known to vendors (through reverse-engineering or leaks), 20% rotation per quarter ensures vendors can't permanently optimize against our exact queries.

Cost per category

Stage API cost Human cost
1 — Persona Generator ~$0.50 (Claude)
2 — Prompt Brainstormer ~$1.50 (Claude)
3 — Reality Check $0 (free APIs)
4 — Multi-agent Validation ~$3 (Claude × 3 critics)
5 — Pilot Test Run ~$5 (3 models × 30 outputs)
6 — Human Approval ~30 min founder + 12h expert ($50100 for non-founder categories)
Total per category ~$10 ~30 min + $50100 for expert categories

For 11 pilot categories: ~$110 API + ~5 hours founder time + ~$500 expert reviewers.

Quarterly refresh cost

Per category per quarter: ~$3 API + 5 minutes founder review.

For 11 categories: ~$35 API + 1 hour founder time per quarter.

Why this is published openly

We publish the process because the integrity of the ranking depends on the integrity of the prompts, and external review of the process is the strongest defense against "your prompts are garbage" attack.

We do NOT publish the exact strings because Goodhart's Law: known prompts get optimized against, ceasing to measure organic AI search behavior.

The boundary between "open process" and "closed strings" is itself documented openly.