DATA — Public reference datasets for methodology:
- data/README.md: schema + format definitions for brand catalogs
- data/swiece-sojowe-pl/brand_catalog.json: 35 tracked brands (33 manufacturers + 2 importers) + 5 excluded marketplaces/resellers
- data/swiece-sojowe-pl/brand_catalog.md: human-readable companion
- data/swiece-sojowe-pl/market_metadata.json: GMV estimate, personas, seasonality, expected dynamics
TOOLS — 6-stage prompt curation pipeline (Python 3.12+):
- tools/prompt_curation/README.md: process documentation + cost estimates
- tools/prompt_curation/config.py: tunable parameters per stage
- tools/prompt_curation/.env.example: required API keys template
- tools/prompt_curation/requirements.txt: dependencies
- tools/prompt_curation/1_persona_generator.py: Claude generates 7 buyer personas
- tools/prompt_curation/2_prompt_brainstormer.py: per persona × 30 prompts in voice
- tools/prompt_curation/3_reality_checker.py: Google Trends + Reddit cross-check
- tools/prompt_curation/4_validation_agents.py: 3 critic agents async (real_buyer/methodology/exploit_hunter)
- tools/prompt_curation/5_pilot_test_runner.py: sample × 3 LLM models pre-flight
- tools/prompt_curation/6_human_review_export.py: CSV export for founder approval
- tools/prompt_curation/7_finalize.py: post-approval → closed prompts/{cat}/v{N}.json
- tools/prompt_curation/pipeline.py: orchestrator (stages 1–6, then human review, then 7)
GITIGNORE — Fixed .env.* exclusion to allow .env.example.
This commit completes Faza 1. Stages outputs (data/{cat}/personas.json,
raw_prompts.json, validated_prompts.json, critic_review.json, pilot_test_results.json,
for_human_review.csv) are runtime artifacts — public when committed, derived from
public methodology + public brand catalog. Final approved prompt strings in
prompts/{cat}/v{N}.json remain CLOSED (gitignored, anti-Goodhart's Law).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
92 lines
4.2 KiB
Markdown
92 lines
4.2 KiB
Markdown
# Citee Index Data
|
|
|
|
> Public datasets used by Citee Index methodology. Brand catalogs per category, market metadata, model weight calibration sources.
|
|
|
|
This directory is **public** — anything here is part of the open methodology. Closed operational data (exact prompts, anti-gaming thresholds, scan outputs) lives elsewhere (gitignored or in separate access-controlled storage).
|
|
|
|
---
|
|
|
|
## Structure
|
|
|
|
```
|
|
data/
|
|
├── README.md (this file)
|
|
├── {category-slug}/
|
|
│ ├── brand_catalog.json # Brands tracked, normalized names, aliases, type
|
|
│ ├── brand_catalog.md # Human-readable companion to JSON
|
|
│ └── market_metadata.json # Market depth, GMV estimate, seasonality flags
|
|
├── model_weights/
|
|
│ └── pl-2026-q2.json # Quarterly weight calibration with sources
|
|
└── shared/
|
|
└── prompt_type_definitions.md # Detailed definitions of 5 prompt types
|
|
```
|
|
|
|
## Brand catalog schema
|
|
|
|
Each `brand_catalog.json` follows this schema:
|
|
|
|
```json
|
|
{
|
|
"category": "swiece-sojowe-pl",
|
|
"country": "PL",
|
|
"version": "1.0.0",
|
|
"last_updated": "2026-05-03",
|
|
"brands": [
|
|
{
|
|
"id": "jakulo",
|
|
"name": "JAKULO",
|
|
"aliases": ["Jakulo", "jakulo", "jakulo.pl"],
|
|
"domain": "jakulo.pl",
|
|
"type": "manufacturer",
|
|
"country_origin": "PL",
|
|
"segment": "premium-handmade",
|
|
"founded": 2022,
|
|
"active_in_category_since": 2022,
|
|
"notes": "Soy candles, handmade, Łódź-based"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Field definitions
|
|
|
|
- **id:** unique slug (lowercase, hyphenated). Used as primary key in scan outputs.
|
|
- **name:** canonical display name (mixed case as brand presents itself).
|
|
- **aliases:** all variations to detect in LLM outputs (case-insensitive matching during scan).
|
|
- **domain:** primary website. Used for citation depth scoring (direct link to brand.com vs mention only).
|
|
- **type:** `manufacturer` (own products), `importer` (foreign brand sold in country), `reseller` (multi-brand retailer).
|
|
- **country_origin:** ISO 3166-1 alpha-2. For PL ranking, includes both `PL` (Polish brands) and foreign brands actively sold in PL market.
|
|
- **segment:** `premium-handmade`, `premium`, `mid`, `budget`, `mass-market`. Subjective categorization, used for cross-cutting reports.
|
|
- **founded:** year, if known.
|
|
- **active_in_category_since:** year brand started selling in this specific category (may differ from founding if pivoted).
|
|
- **notes:** free-text human-readable context.
|
|
|
|
### Type policy
|
|
|
|
- **Manufacturers** are the primary scoring targets — these are the brands that benefit most from AI visibility.
|
|
- **Importers** are included if they have meaningful PL market presence (e.g., Yankee Candle PL imports, sells through own channels). Marked `type: importer`.
|
|
- **Resellers** (Notino, Sephora, Empik) are tracked as **mention-only** — they appear in AI answers but don't have proprietary brand identity in this category. Stored separately in `resellers.json` and not ranked.
|
|
|
|
### Excluded entities
|
|
|
|
The following are tracked as mentions but explicitly excluded from ranking:
|
|
- **Marketplaces** (Allegro, Empik, Ceneo) — not brands, just sales channels
|
|
- **Generic categories** (any "świece sojowe" mentions without brand attribution)
|
|
- **Honeypot brand** (fictional brand inserted by Citee — see `methodology.json` for policy, exact identity closed)
|
|
|
|
## Adding a new brand
|
|
|
|
When a new brand appears in scan outputs (detected via Stage 4 of curation pipeline or manually), it should be added to `brand_catalog.json` with at minimum: `id`, `name`, `aliases`, `domain`. Other fields filled in over time.
|
|
|
|
Adding a brand:
|
|
1. Edit `brand_catalog.json` for the relevant category
|
|
2. Bump version (1.0.0 → 1.0.1 for additions, 1.1.0 if methodology change accompanies)
|
|
3. Update `last_updated`
|
|
4. Commit with message like: `data: add Brand X to swiece-sojowe-pl catalog (detected in Q2 2026 scan)`
|
|
|
|
## Versioning
|
|
|
|
Brand catalog updates do NOT trigger methodology version bumps (they're data, not formula). They follow their own semver:
|
|
- **PATCH** (1.0.1) — adding/removing brands, updating aliases
|
|
- **MINOR** (1.1.0) — schema changes (new fields), category restructuring
|
|
- **MAJOR** (2.0.0) — incompatible structural changes
|