# Citee Index Data > Public datasets used by Citee Index methodology. Brand catalogs per category, market metadata, model weight calibration sources. This directory is **public** — anything here is part of the open methodology. Closed operational data (exact prompts, anti-gaming thresholds, scan outputs) lives elsewhere (gitignored or in separate access-controlled storage). --- ## Structure ``` data/ ├── README.md (this file) ├── {category-slug}/ │ ├── brand_catalog.json # Brands tracked, normalized names, aliases, type │ ├── brand_catalog.md # Human-readable companion to JSON │ └── market_metadata.json # Market depth, GMV estimate, seasonality flags ├── model_weights/ │ └── pl-2026-q2.json # Quarterly weight calibration with sources └── shared/ └── prompt_type_definitions.md # Detailed definitions of 5 prompt types ``` ## Brand catalog schema Each `brand_catalog.json` follows this schema: ```json { "category": "swiece-sojowe-pl", "country": "PL", "version": "1.0.0", "last_updated": "2026-05-03", "brands": [ { "id": "jakulo", "name": "JAKULO", "aliases": ["Jakulo", "jakulo", "jakulo.pl"], "domain": "jakulo.pl", "type": "manufacturer", "country_origin": "PL", "segment": "premium-handmade", "founded": 2022, "active_in_category_since": 2022, "notes": "Soy candles, handmade, Łódź-based" } ] } ``` ### Field definitions - **id:** unique slug (lowercase, hyphenated). Used as primary key in scan outputs. - **name:** canonical display name (mixed case as brand presents itself). - **aliases:** all variations to detect in LLM outputs (case-insensitive matching during scan). - **domain:** primary website. Used for citation depth scoring (direct link to brand.com vs mention only). - **type:** `manufacturer` (own products), `importer` (foreign brand sold in country), `reseller` (multi-brand retailer). - **country_origin:** ISO 3166-1 alpha-2. For PL ranking, includes both `PL` (Polish brands) and foreign brands actively sold in PL market. - **segment:** `premium-handmade`, `premium`, `mid`, `budget`, `mass-market`. Subjective categorization, used for cross-cutting reports. - **founded:** year, if known. - **active_in_category_since:** year brand started selling in this specific category (may differ from founding if pivoted). - **notes:** free-text human-readable context. ### Type policy - **Manufacturers** are the primary scoring targets — these are the brands that benefit most from AI visibility. - **Importers** are included if they have meaningful PL market presence (e.g., Yankee Candle PL imports, sells through own channels). Marked `type: importer`. - **Resellers** (Notino, Sephora, Empik) are tracked as **mention-only** — they appear in AI answers but don't have proprietary brand identity in this category. Stored separately in `resellers.json` and not ranked. ### Excluded entities The following are tracked as mentions but explicitly excluded from ranking: - **Marketplaces** (Allegro, Empik, Ceneo) — not brands, just sales channels - **Generic categories** (any "świece sojowe" mentions without brand attribution) - **Honeypot brand** (fictional brand inserted by Citee — see `methodology.json` for policy, exact identity closed) ## Adding a new brand When a new brand appears in scan outputs (detected via Stage 4 of curation pipeline or manually), it should be added to `brand_catalog.json` with at minimum: `id`, `name`, `aliases`, `domain`. Other fields filled in over time. Adding a brand: 1. Edit `brand_catalog.json` for the relevant category 2. Bump version (1.0.0 → 1.0.1 for additions, 1.1.0 if methodology change accompanies) 3. Update `last_updated` 4. Commit with message like: `data: add Brand X to swiece-sojowe-pl catalog (detected in Q2 2026 scan)` ## Versioning Brand catalog updates do NOT trigger methodology version bumps (they're data, not formula). They follow their own semver: - **PATCH** (1.0.1) — adding/removing brands, updating aliases - **MINOR** (1.1.0) — schema changes (new fields), category restructuring - **MAJOR** (2.0.0) — incompatible structural changes