Kaomojis Methodology — How We Curate, Classify & QA

This page explains, in engineering terms, how Kaomojis curates, classifies, and maintains quality for 55,000+ kaomoji. Rather than a simple copy-paste list, we run data design, deduplication, multilingual annotation, ranking, and translation QA as an integrated pipeline.

1. Ingestion

The primary sources are public kaomoji long in circulation on Japanese BBSes and social media. We also add originals created by our editors and carefully reviewed user submissions via /request.

An automated screening step at ingest rejects character-based, trademarked, or otherwise copyrighted content.

All strings undergo Unicode NFC normalisation so that half-width/full-width and composed-form variants are unified.

2. Deduplication

Every kaomoji text is SHA-256 hashed and enforced with a UNIQUE constraint at the database layer.

Near-duplicates (Levenshtein distance, partial matches) are flagged for human review.

In operation, about 8% of candidate additions are rejected as duplicates automatically.

3. Categorisation Algorithm

Classification is three-layered: Category (emotion: happy/sad/love/angry/cute…), Event (birthday/christmas/halloween…), Scene (cat/morning/sakura…). A single kaomoji can belong to multiple labels.

Candidate labels come from an AI draft → editor review pipeline, persisted as a JSON column in SQLite.

Since Phase 122 (fine-grained emotion taxonomy) we expanded from 7 to 48 emotions, improving coverage of long-tail queries.

4. Ranking Algorithm

Each kaomoji has a dynamic score: score = α × copyCount + β × favoriteCount + γ × engagementSec − δ × daysSinceLastCopy.

α, β, γ, δ are tunable hyper-parameters; currently copyCount carries the heaviest weight.

Scores are recomputed every 15 minutes and drive ranking pages, popular sections, and related-kaomoji suggestions.

5. Multilingual Metadata Pipeline

Each kaomoji carries usage examples, keywords, and cultural notes in all 12 languages.

Drafts come from AI translation; major languages (Japanese, English, Chinese, Korean, Spanish, Portuguese) undergo human review.

Keywords are not literal translations but natural search expressions per language. For example, Japanese "癒し" expands in English to ["healing", "soothing", "comfort", "calm"].

6. Translation QA

scripts/audit-translations.mjs --strict runs weekly to measure untranslated rate per language. Target < 5%.

Adding a new language follows the `.claude/skills/adding-language` checklist covering 16 files so nothing is missed.

Translation feedback is welcome at [email protected].

7. Infrastructure & Delivery

Stack: Astro 6 SSR + React 19 Islands + SQLite (better-sqlite3) + PM2 cluster mode + Cloudflare CDN.

TTFB: roughly 25ms warm.

Sitemap: 415,000+ URLs (detail × 12 locales × alternate links), sharded to stay under Google's 50MB limit.

CDN strategy: s-maxage=86400 via Cloudflare. A/B tests are cookie-driven but SSR output is fixed-variant to preserve CDN HIT rate.

8. Continuous Improvement

Every change since inception is recorded as a "Phase" — we are at Phase 554 as of 2026-04.

GA4 + GSC analysis runs 5× daily via `/analytics-review` cron; improvement opportunities are often deployed the next day.

A/B tests are evaluated automatically by the `/ab-test` skill, which declares a winner and starts the next test.

Last updated: 2026-04-15 (Phase 554)

→ About → Contact → Public REST API → Guide