# Citee Index Data

> Public datasets used by Citee Index methodology. Brand catalogs per category, market metadata, model weight calibration sources.

This directory is **public** — anything here is part of the open methodology. Closed operational data (exact prompts, anti-gaming thresholds, scan outputs) lives elsewhere (gitignored or in separate access-controlled storage).

---

## Structure

```
data/
├── README.md                          (this file)
├── {category-slug}/
│   ├── brand_catalog.json             # Brands tracked, normalized names, aliases, type
│   ├── brand_catalog.md               # Human-readable companion to JSON
│   └── market_metadata.json           # Market depth, GMV estimate, seasonality flags
├── model_weights/
│   └── pl-2026-q2.json                # Quarterly weight calibration with sources
└── shared/
    └── prompt_type_definitions.md     # Detailed definitions of 5 prompt types
```

## Brand catalog schema

Each `brand_catalog.json` follows this schema:

```json
{
  "category": "swiece-sojowe-pl",
  "country": "PL",
  "version": "1.0.0",
  "last_updated": "2026-05-03",
  "brands": [
    {
      "id": "jakulo",
      "name": "JAKULO",
      "aliases": ["Jakulo", "jakulo", "jakulo.pl"],
      "domain": "jakulo.pl",
      "type": "manufacturer",
      "country_origin": "PL",
      "segment": "premium-handmade",
      "founded": 2022,
      "active_in_category_since": 2022,
      "notes": "Soy candles, handmade, Łódź-based"
    }
  ]
}
```

### Field definitions

- **id:** unique slug (lowercase, hyphenated). Used as primary key in scan outputs.
- **name:** canonical display name (mixed case as brand presents itself).
- **aliases:** all variations to detect in LLM outputs (case-insensitive matching during scan).
- **domain:** primary website. Used for citation depth scoring (direct link to brand.com vs mention only).
- **type:** `manufacturer` (own products), `importer` (foreign brand sold in country), `reseller` (multi-brand retailer).
- **country_origin:** ISO 3166-1 alpha-2. For PL ranking, includes both `PL` (Polish brands) and foreign brands actively sold in PL market.
- **segment:** `premium-handmade`, `premium`, `mid`, `budget`, `mass-market`. Subjective categorization, used for cross-cutting reports.
- **founded:** year, if known.
- **active_in_category_since:** year brand started selling in this specific category (may differ from founding if pivoted).
- **notes:** free-text human-readable context.

### Type policy

- **Manufacturers** are the primary scoring targets — these are the brands that benefit most from AI visibility.
- **Importers** are included if they have meaningful PL market presence (e.g., Yankee Candle PL imports, sells through own channels). Marked `type: importer`.
- **Resellers** (Notino, Sephora, Empik) are tracked as **mention-only** — they appear in AI answers but don't have proprietary brand identity in this category. Stored separately in `resellers.json` and not ranked.

### Excluded entities

The following are tracked as mentions but explicitly excluded from ranking:
- **Marketplaces** (Allegro, Empik, Ceneo) — not brands, just sales channels
- **Generic categories** (any "świece sojowe" mentions without brand attribution)
- **Honeypot brand** (fictional brand inserted by Citee — see `methodology.json` for policy, exact identity closed)

## Adding a new brand

When a new brand appears in scan outputs (detected via Stage 4 of curation pipeline or manually), it should be added to `brand_catalog.json` with at minimum: `id`, `name`, `aliases`, `domain`. Other fields filled in over time.

Adding a brand:
1. Edit `brand_catalog.json` for the relevant category
2. Bump version (1.0.0 → 1.0.1 for additions, 1.1.0 if methodology change accompanies)
3. Update `last_updated`
4. Commit with message like: `data: add Brand X to swiece-sojowe-pl catalog (detected in Q2 2026 scan)`

## Versioning

Brand catalog updates do NOT trigger methodology version bumps (they're data, not formula). They follow their own semver:
- **PATCH** (1.0.1) — adding/removing brands, updating aliases
- **MINOR** (1.1.0) — schema changes (new fields), category restructuring
- **MAJOR** (2.0.0) — incompatible structural changes