citee-methodology/data/README.md
Jacek Kubas 03a397343e Faza 1: brand catalog (świece sojowe PL) + prompt curation pipeline
DATA — Public reference datasets for methodology:
- data/README.md: schema + format definitions for brand catalogs
- data/swiece-sojowe-pl/brand_catalog.json: 35 tracked brands (33 manufacturers + 2 importers) + 5 excluded marketplaces/resellers
- data/swiece-sojowe-pl/brand_catalog.md: human-readable companion
- data/swiece-sojowe-pl/market_metadata.json: GMV estimate, personas, seasonality, expected dynamics

TOOLS — 6-stage prompt curation pipeline (Python 3.12+):
- tools/prompt_curation/README.md: process documentation + cost estimates
- tools/prompt_curation/config.py: tunable parameters per stage
- tools/prompt_curation/.env.example: required API keys template
- tools/prompt_curation/requirements.txt: dependencies
- tools/prompt_curation/1_persona_generator.py: Claude generates 7 buyer personas
- tools/prompt_curation/2_prompt_brainstormer.py: per persona × 30 prompts in voice
- tools/prompt_curation/3_reality_checker.py: Google Trends + Reddit cross-check
- tools/prompt_curation/4_validation_agents.py: 3 critic agents async (real_buyer/methodology/exploit_hunter)
- tools/prompt_curation/5_pilot_test_runner.py: sample × 3 LLM models pre-flight
- tools/prompt_curation/6_human_review_export.py: CSV export for founder approval
- tools/prompt_curation/7_finalize.py: post-approval → closed prompts/{cat}/v{N}.json
- tools/prompt_curation/pipeline.py: orchestrator (stages 1–6, then human review, then 7)

GITIGNORE — Fixed .env.* exclusion to allow .env.example.

This commit completes Faza 1. Stages outputs (data/{cat}/personas.json,
raw_prompts.json, validated_prompts.json, critic_review.json, pilot_test_results.json,
for_human_review.csv) are runtime artifacts — public when committed, derived from
public methodology + public brand catalog. Final approved prompt strings in
prompts/{cat}/v{N}.json remain CLOSED (gitignored, anti-Goodhart's Law).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:40:12 +02:00

4.2 KiB

Citee Index Data

Public datasets used by Citee Index methodology. Brand catalogs per category, market metadata, model weight calibration sources.

This directory is public — anything here is part of the open methodology. Closed operational data (exact prompts, anti-gaming thresholds, scan outputs) lives elsewhere (gitignored or in separate access-controlled storage).


Structure

data/
├── README.md                          (this file)
├── {category-slug}/
│   ├── brand_catalog.json             # Brands tracked, normalized names, aliases, type
│   ├── brand_catalog.md               # Human-readable companion to JSON
│   └── market_metadata.json           # Market depth, GMV estimate, seasonality flags
├── model_weights/
│   └── pl-2026-q2.json                # Quarterly weight calibration with sources
└── shared/
    └── prompt_type_definitions.md     # Detailed definitions of 5 prompt types

Brand catalog schema

Each brand_catalog.json follows this schema:

{
  "category": "swiece-sojowe-pl",
  "country": "PL",
  "version": "1.0.0",
  "last_updated": "2026-05-03",
  "brands": [
    {
      "id": "jakulo",
      "name": "JAKULO",
      "aliases": ["Jakulo", "jakulo", "jakulo.pl"],
      "domain": "jakulo.pl",
      "type": "manufacturer",
      "country_origin": "PL",
      "segment": "premium-handmade",
      "founded": 2022,
      "active_in_category_since": 2022,
      "notes": "Soy candles, handmade, Łódź-based"
    }
  ]
}

Field definitions

  • id: unique slug (lowercase, hyphenated). Used as primary key in scan outputs.
  • name: canonical display name (mixed case as brand presents itself).
  • aliases: all variations to detect in LLM outputs (case-insensitive matching during scan).
  • domain: primary website. Used for citation depth scoring (direct link to brand.com vs mention only).
  • type: manufacturer (own products), importer (foreign brand sold in country), reseller (multi-brand retailer).
  • country_origin: ISO 3166-1 alpha-2. For PL ranking, includes both PL (Polish brands) and foreign brands actively sold in PL market.
  • segment: premium-handmade, premium, mid, budget, mass-market. Subjective categorization, used for cross-cutting reports.
  • founded: year, if known.
  • active_in_category_since: year brand started selling in this specific category (may differ from founding if pivoted).
  • notes: free-text human-readable context.

Type policy

  • Manufacturers are the primary scoring targets — these are the brands that benefit most from AI visibility.
  • Importers are included if they have meaningful PL market presence (e.g., Yankee Candle PL imports, sells through own channels). Marked type: importer.
  • Resellers (Notino, Sephora, Empik) are tracked as mention-only — they appear in AI answers but don't have proprietary brand identity in this category. Stored separately in resellers.json and not ranked.

Excluded entities

The following are tracked as mentions but explicitly excluded from ranking:

  • Marketplaces (Allegro, Empik, Ceneo) — not brands, just sales channels
  • Generic categories (any "świece sojowe" mentions without brand attribution)
  • Honeypot brand (fictional brand inserted by Citee — see methodology.json for policy, exact identity closed)

Adding a new brand

When a new brand appears in scan outputs (detected via Stage 4 of curation pipeline or manually), it should be added to brand_catalog.json with at minimum: id, name, aliases, domain. Other fields filled in over time.

Adding a brand:

  1. Edit brand_catalog.json for the relevant category
  2. Bump version (1.0.0 → 1.0.1 for additions, 1.1.0 if methodology change accompanies)
  3. Update last_updated
  4. Commit with message like: data: add Brand X to swiece-sojowe-pl catalog (detected in Q2 2026 scan)

Versioning

Brand catalog updates do NOT trigger methodology version bumps (they're data, not formula). They follow their own semver:

  • PATCH (1.0.1) — adding/removing brands, updating aliases
  • MINOR (1.1.0) — schema changes (new fields), category restructuring
  • MAJOR (2.0.0) — incompatible structural changes