fix: prevent false positives in OpenAI web report from lastmod churn by rollysys · Pull Request #181 · duanyytop/agents-radar

rollysys · 2026-03-15T06:13:37Z

First of all, thank you for building agents-radar! It's an incredibly well-architected project — the pipeline design, the bilingual support, and the sitemap-based web tracking are all excellent. I've learned a lot from the codebase. 🙏

Problem

The OpenAI web report consistently produces unreliable content because of two compounding issues:

1. Sitemap `lastmod` reflects generation time, not publication time

OpenAI's sitemap regenerates lastmod timestamps on every crawl cycle. Since metadataOnly is true for OpenAI (Cloudflare blocks page fetches), the system can't verify actual content changes. Result: every run flags 50+ unchanged URLs as "new".

2. LLM over-interprets URL slugs

With metadataOnly mode, titles are derived from URL path segments (e.g., introducing-gpt-5-2 → "Introducing Gpt 5 2"). The prompt labels missing content as "Unable to extract text content", but the LLM still tries to analyze these slug-derived titles as if they were real headlines, producing speculative and potentially misleading analysis.

Example from actual output:

请注意：OpenAI 的抓取列表显示了大量 URL，但主要内容文本提取失败。然而，这些 URL 拼写（如 Introducing Gpt 5 2、Gpt 5 1 Codex Max）本身就构成了极强的战略信号。

Fix

src/web.ts — For metadataOnly sites, skip lastmod-based change detection. Only truly never-seen URLs are treated as new:

// Before: triggers on lastmod changes (false positives for OpenAI)
if (lastmod && lastmod > prev) return true;

// After: only for sites where we can verify content changed
if (!cfg.metadataOnly && lastmod && lastmod > prev) return true;

src/prompts-data.ts — Two improvements:

Replace generic "Unable to extract text content" with explicit metadata-only explanation (in both EN/ZH)
Add clear instruction to the LLM not to speculate on URL slug meanings or fabricate summaries

Result

Metric	Before	After
"New" OpenAI URLs per run	50+ (false positives)	0 (only genuinely new)
LLM behavior	Speculates on URL slugs	Reports data limitation clearly

Test plan

TypeScript compiles cleanly
Verified locally: incremental run detects 0 false-positive new URLs for OpenAI
Anthropic (non-metadataOnly) continues to use lastmod-based detection as before

🤖 Generated with Claude Code

Problem: OpenAI sitemap lastmod values reflect sitemap generation time, not article publication time. Every run, hundreds of unchanged URLs get flagged as "new" because their lastmod changed, producing a web report full of URL-slug-derived titles with no actual content — which the LLM then over-interprets. Fix: 1. For metadataOnly sites, only treat truly never-seen URLs as new (ignore lastmod updates on already-tracked URLs) 2. Clearly label metadata-only items in the prompt so the LLM knows titles are URL-slug-derived and no content is available 3. Add explicit instruction in both EN/ZH prompts telling the LLM not to speculate on URL slug meanings or fabricate content summaries Before: every run detected 50+ "new" OpenAI URLs → speculative analysis After: only genuinely new URLs appear → factual reporting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

duanyytop · 2026-03-15T12:48:16Z

Thanks for the PR, @rollysys! This is a well-targeted fix for a real issue — the lastmod churn on OpenAI's sitemap was indeed causing 50+ false positives per run.

The approach looks solid:

Skipping lastmod comparison for metadataOnly sites is the right call since the timestamps reflect sitemap generation time, not actual content changes.
The prompt improvements (metadata-only annotation + explicit "do not speculate" directive) should make the LLM output much more reliable.

One minor suggestion for future consideration: the metadata-only warning in the prompt could be driven dynamically by the metadataOnly config flag rather than being hardcoded for OpenAI, so it automatically applies if another site ever needs the same treatment. But that's not a blocker.

LGTM, will merge. Thanks again!

duanyytop merged commit fde93e9 into duanyytop:master Mar 15, 2026
1 check passed

This was referenced Mar 24, 2026

📊 AI CLI Tools Digest 2026-03-24 jstamagal/agents-radar#100

Open

📊 AI CLI Tools Digest 2026-03-24 DenisZheng/agents-radar#164

Open

This was referenced Apr 2, 2026

📊 AI CLI Tools Digest 2026-04-02 yinwm/agents-radar#135

Open

📊 AI CLI Tools Digest 2026-04-06 yinwm/agents-radar#163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent false positives in OpenAI web report from lastmod churn#181

fix: prevent false positives in OpenAI web report from lastmod churn#181
duanyytop merged 1 commit intoduanyytop:masterfrom
rollysys:fix/openai-web-lastmod-churn

rollysys commented Mar 15, 2026

Uh oh!

duanyytop commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rollysys commented Mar 15, 2026

Problem

1. Sitemap lastmod reflects generation time, not publication time

2. LLM over-interprets URL slugs

Fix

Result

Test plan

Uh oh!

duanyytop commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Sitemap `lastmod` reflects generation time, not publication time

duanyytop commented Mar 15, 2026 •

edited

Loading