News Intelligence Platform
A full-stack news intelligence system that ingests RSS sources, clusters stories, scores editorial relevance, and turns a noisy news cycle into a usable queue.
What it is
A full-stack news intelligence system built around editorial decisions. It ingests 113 RSS sources, clusters related stories, scores them for relevance, and turns a noisy feed surface into a queue an editor can inspect.
Why I built it
As a working tech journalist, I spend a lot of time doing the same thing every morning: scanning feeds, deciding what is genuinely new, what is a repeat of yesterday, and what is worth writing about. I wanted to turn that into a system rather than a reflex.
What it does
The pipeline runs in 30-minute cycles through six stages:
- Fetch - pulls RSS from 113 configured sources.
- Extract - parses titles, body text, publication metadata, and source identity.
- Deduplicate - uses pgvector semantic similarity (1,536-dim OpenAI embeddings) to flag stories that are repeats or near-repeats of earlier coverage.
- Embed - generates fresh embeddings for new items.
- Cluster - groups related articles by semantic proximity so an editor sees the story, not ten versions of it.
- Score - applies a four-factor editorial scoring model with the breakdown stored as JSONB so the reasoning is inspectable.
The frontend is a Next.js 14 editorial dashboard backed by 20+ FastAPI endpoints. LLM-generated outputs are treated as inputs for review, with fallbacks when a primary model is unavailable.
The hard part
The hardest part was not the pipelines or the embeddings. It was the scoring. “Is this worth covering” is an editorial judgement, not a classification problem. The four-factor model only works if the reasoning is visible, so the score breakdown is stored and reviewed instead of hidden behind a number.
What I learned
- Semantic deduplication catches stories that look different but are the same event, which is more useful than URL-based deduplication.
- JSONB score breakdowns matter. An editor who cannot audit why something was scored high will not trust the queue.
- A 30-minute cycle is fast enough to be useful and slow enough to be cheap. Real-time is not the goal; a clean queue is.
Current status
MVP. Running internally alongside GhostWriter and the KeplerClaw infrastructure. Not a public product yet.