Changelog
Every PR merged by human or robot. Full transparency on what changed and why.
30 entries / generated May 8, 2026
Added `generate_day_summary()` function (mirroring `generate_week_theme`) that uses Haiku to generate JSON with `dayTheme` and `daySummary` per digest. The digest loop detects the latest day and regenerates both fields on each export; prior days reuse cached values from the previous JSON.
The weekly digest hero needs a headline that reflects the current day's top-story selection and refreshes hourly, but the existing week-scoped `weekTheme`/`weekSummary` are cached and don't change between exports.
Added hashtag derivation logic that queries prediction-linked entities and top global entities (10+ mentions), converts them to cased hashtags (e.g., #OpenAI), and passes candidates to the tweet prompt so Haiku selects 2-4 most relevant ones.
Tweets lack relevant, consistent hashtags. By leveraging the entity glossary (already extracted during scoring), the system can automatically suggest contextually appropriate hashtags while preserving brand identity.
Added a LEFT JOIN with `prediction_metrics_snapshot` to include topic velocity and source metadata, populated `signals[]` with evidence stories via a separate query, and exposed 7 previously unexported columns (evaluatedAt, targetDate, evalHorizon, tweet, tweetReply, postedToXAt)—all additive, no breaking changes.
The predictions export was missing contextual data (topic velocity, source signals, evaluation timestamps) that would help the frontend and downstream consumers understand the grounding and temporal context of each prediction.
Added `get_tag_distribution()` (top 20 tags from last 30 days) and `get_tier_distribution()` (notable/essential split from last 7 days) functions to `push_stats.py`, and fixed `get_recent_scored_stories()` to include all digested stories and sort by `scored_at` instead of `fetched_at`. Hoisted `stories_per_hour_24h` and `backlog_trend_6h` to the top-level stats dict so the oracle-pi API receives them correctly.
The oracle-pi site needed insights into tag distribution, tier split, and recent scoring activity to populate the Data Quality dashboard. The `push_stats` pipeline was missing these metrics and had a bug where recent stories were incorrectly filtered and sorted, limiting visibility into what the system was scoring.
fetch.py now enforces a `daily_cap` config key per RSS source, counting stories fetched today and skipping the source once the cap is reached. manage_sources.py validates `daily_cap` and `max_per_fetch` as positive integers and adds a `set-config` subcommand for updating source configuration from the CLI.
Noisy RSS sources can overwhelm the pipeline by fetching excessive stories per day. This change enables per-source rate limiting to control ingestion volume and prevent resource waste.
- Adds `subjectCount`, `sourceCount`, `recentMentionCount`, `peakRelevanceScore` via a single batch aggregation JOIN (no N+1)
- Adds `mentionsByDay` to the glossary JSON export — 30-day daily entity mention counts from `entity_mentions` + `stories` join
Added normalization logic that converts common relevance score strings to their numeric equivalents, while preserving validation failure behavior for unrecognized values so they retry through existing error handling.
Claude's relevance score field was sometimes returning string values ("high"/"medium"/"low") instead of numeric scores, causing validation failures and blocking the pipeline.
Switched delivery counting from story status to the digest ledger, updated "processed today" logic to check fetch/digest timestamps instead, and added regression tests for cases where story status has drifted.
Story status fields can drift away from actual delivery state, leading to inaccurate delivery counts. The digest ledger provides authoritative tracking of what was actually delivered.
Updated the `score_one` function's normalization logic to unwrap nested tuple wrappers from `score_story` outputs, and added regression test coverage for both direct and nested tuple-shaped outputs.
Score outputs were being wrapped in nested tuples, causing the normalization logic to fail during story scoring. This prevented the pipeline from handling certain score output formats.
Switched `evaluated_at` and `expiry` fields to timezone-aware UTC datetimes, updated prediction context-building to include same-day stories, and added regression tests for prediction evaluation and queue ordering.
Timezone-naive timestamps were causing incorrect prediction evaluation and expiry logic, leading to off-by-one errors in story context and missed predictions (issue #166).
Added confidence levels (high/medium/moonshot) to past predictions shown to Opus and replaced flat deduplication logic with confidence-aware guidance—high-confidence predictions can be revisited and refined, while medium/moonshot and resolved predictions maintain the existing dedup rule.
Prior predictions were deduplicated uniformly, preventing follow-up insights on high-confidence topics even when new signals emerged. The change enables the deep predictions engine to reinforce or challenge high-confidence prior predictions based on fresh data.
Coerced missing score fields before validation to handle malformed data gracefully, tightened entity backfill prompts for better extraction accuracy, and updated deep predictions learnings based on operational feedback.
Stories with missing score fields were causing validation failures (#122), and entity backfill prompts needed to be more precise to handle unknown entities correctly.
The fix coerces missing score fields to default values before validation runs, preventing validation errors. Deep predictions skill docs were updated to clarify usage and implementation details.
Article validation was failing when score fields were missing (issue #122), and the deep predictions skill documentation needed updating to reflect current learnings and best practices.
Added field coercion with defaults before validation runs, so stories with partial Claude responses now score successfully as long as the essential fields are populated.
Claude responses for story scoring sometimes omitted optional fields (summary_long, summary_short, entities), causing validation to fail even when core scoring data was present.
Refactored the dry-run path to be purely read-only, eliminating Claude calls and DB writes during preview, and added regression tests to prevent future violations.
Dry-run mode was making Claude API calls and writing to the database despite being intended as a preview-only operation, wasting API quota and creating unintended side effects.
Hardened `process_one` filter handling to normalize tuple-wrapped outputs, matching the existing score-path normalization, and added a regression test to verify stories advance cleanly through tuple-wrapped filter results.
The `process_one` filter path was not handling tuple-wrapped return values, while the score path already had normalization for this case. This inconsistency caused stories to fail advancement when filters returned tuples (issue #132).
Added type coercion in the validation pipeline to normalize digit-string scores to integers before validation; included a regression test covering the batch scoring path and updated the improvement log.
Claude's JSON output sometimes wraps `relevance_score` as a digit-string (e.g., `"4"` instead of `4`), causing valid scores to fail validation and block story ingestion.
Added retry logic (up to 3 attempts) for Hacker News top-stories and item JSON requests to tolerate temporary connection failures.
Hacker News API requests were failing due to transient network issues, causing unnecessary fetch errors and reducing pipeline reliability.
Replaced `export_site.py`'s local connection wrapper with the shared DB connection helper from `utils/`, added regression tests for the shared connection/migration behavior, and updated tests to match the current JSON and predictions schema.
Eliminated duplicate database connection management in `export_site.py` to enforce the "one writer" architectural pattern documented in the repo doctrine, ensuring consistent connection behavior and migration handling across the codebase.
Added `get_recent_scored_stories()` to `push_stats.py` that queries the last 10 scored/digested stories and writes them to Redis (`oracle:stats:recent_stories`) on every push_stats run (every minute), including title, URL, score, timestamp, and source metadata.
The status page on oracle-pi needs access to recent scored stories to display live data without querying SQLite directly.
Expanded `push_stats.py` to include additional pipeline/throughput/scoring/system fields, increased hourly history retention from 48 to 720 entries (~30 days), and added a new 90-day daily summary key; added backfill logic to `post_to_x.py` to generate missing tweet replies via Haiku for predictions that have tweets.
Extend observability of the pipeline by capturing more granular metrics over longer retention periods, and ensure all predictions have corresponding tweet replies.
Added three-tier caching for haikus, week themes, and PR summaries: check the SQLite DB and previously exported JSON files first, only generate via Haiku API for genuinely new dates/content.
Re-exports to oracle-pi were making ~37 unnecessary Haiku API calls per run when content hadn't changed, wasting quota and slowing the pipeline.
Replaced `escape_ts_string`/`render_ts_file` with `json.dumps`, emit `manifest.json` instead of `index.ts` dynamic imports, and converted all downstream data files (glossary, trends, changelog, predictions) to JSON with support for historical re-exports via `--week` flag.
Hand-built TypeScript string rendering was complex and fragile; switching to JSON simplifies the data export pipeline and eliminates 311 lines of string-building code.
Added `tweet_reply` column to predictions table to store signal reasoning; `post_to_x.py` now posts prediction as main tweet + self-reply thread explaining the supporting data points, with reply failure not blocking the main tweet.
Predictions were using overly declarative language ("bet", "will") that suggested false certainty; shifting to analytical framing ("signal suggests", "watching for") grounds predictions in data and makes them more credible.
Added validation to explicitly check that structured output is a dict before processing; non-dict payloads are now treated as scoring failures and blocked. Includes a regression test for malformed payloads.
Claude's structured output could return malformed payloads that weren't caught, allowing broken data to flow downstream instead of failing at the source.
Added normalisation to unwrap tuple/list-wrapped scores before validation, backfilled missing optional score fields for legacy compatibility, added regression tests for tuple-wrapped results, and implemented a sqlite_vec compatibility fallback for entity embedding deserialisation.
process_one was failing when score outputs came wrapped in tuples/lists, and legacy outputs lacked optional score fields. Additionally, entity embedding deserialisation was breaking on older sqlite_vec builds.
Implemented semantic embedding-based deduplication using all-MiniLM-L6-v2 (384-dim) embeddings with conservative thresholds: entities with similarity >0.90 are auto-merged, and predictions flagged as near-dupes within 7 days are excluded from exports.
Entities and predictions were being duplicated in the system due to naming variants (e.g., 'OpenAI' vs 'Open AI') that simple string matching couldn't catch, leading to redundant data in the knowledge base and exports.
Added `compute_glossary_edges()` to extract co-occurrence patterns between entities and export them as weighted edges (source, target, weight) in `glossary.ts`. Also improved entity extraction prompts to emphasize canonical naming and widen extraction range from 3–8 to 2–10 entities.
The oracle-pi frontend needs relationship data to visualize entities as an interactive graph, showing how glossary entities co-occur and relate to each other.
Extracted a `validate_score_output()` helper in `score.py` that enforces all required fields and type constraints (`relevance_score` must be int 1–5), then applied it consistently in both `score_all()` and `score_one()`. Added comprehensive test suite covering invalid types, out-of-range values, and missing fields.
Score output validation was fragmented across the pipeline with inconsistent checks, allowing invalid Claude responses (wrong types, missing fields) to slip through or cause failures downstream. This PR centralizes validation to catch all issues upfront.