AI / News / Editorial SystemsMVP2026

News Intelligence Platform

A full-stack news intelligence system that ingests RSS sources, clusters stories, scores editorial relevance, and turns a noisy news cycle into a usable queue.

PythonFastAPIPostgreSQLpgvectorRedisNext.js 14OpenAIDocker

What it is

A full-stack news intelligence system built around editorial decisions. It ingests 113 RSS sources, clusters related stories, scores them for relevance, and turns a noisy feed surface into a queue an editor can inspect.

Why I built it

As a working tech journalist, I spend a lot of time doing the same thing every morning: scanning feeds, deciding what is genuinely new, what is a repeat of yesterday, and what is worth writing about. I wanted to turn that into a system rather than a reflex.

What it does

The pipeline runs in 30-minute cycles through six stages:

Fetch - pulls RSS from 113 configured sources.
Extract - parses titles, body text, publication metadata, and source identity.
Deduplicate - uses pgvector semantic similarity (1,536-dim OpenAI embeddings) to flag stories that are repeats or near-repeats of earlier coverage.
Embed - generates fresh embeddings for new items.
Cluster - groups related articles by semantic proximity so an editor sees the story, not ten versions of it.
Score - applies a four-factor editorial scoring model with the breakdown stored as JSONB so the reasoning is inspectable.

The frontend is a Next.js 14 editorial dashboard backed by 20+ FastAPI endpoints. LLM-generated outputs are treated as inputs for review, with fallbacks when a primary model is unavailable.

The hard part

The hardest part was not the pipelines or the embeddings. It was the scoring. “Is this worth covering” is an editorial judgement, not a classification problem. The four-factor model only works if the reasoning is visible, so the score breakdown is stored and reviewed instead of hidden behind a number.

What I learned

Semantic deduplication catches stories that look different but are the same event, which is more useful than URL-based deduplication.
JSONB score breakdowns matter. An editor who cannot audit why something was scored high will not trust the queue.
A 30-minute cycle is fast enough to be useful and slow enough to be cheap. Real-time is not the goal; a clean queue is.

Current status

MVP. Running internally alongside GhostWriter and the KeplerClaw infrastructure. Not a public product yet.

← All projects Ask about this →