Skip to content

fix: prevent false positives in OpenAI web report from lastmod churn#181

Merged
duanyytop merged 1 commit intoduanyytop:masterfrom
rollysys:fix/openai-web-lastmod-churn
Mar 15, 2026
Merged

fix: prevent false positives in OpenAI web report from lastmod churn#181
duanyytop merged 1 commit intoduanyytop:masterfrom
rollysys:fix/openai-web-lastmod-churn

Conversation

@rollysys
Copy link
Copy Markdown
Contributor

First of all, thank you for building agents-radar! It's an incredibly well-architected project — the pipeline design, the bilingual support, and the sitemap-based web tracking are all excellent. I've learned a lot from the codebase. 🙏

Problem

The OpenAI web report consistently produces unreliable content because of two compounding issues:

1. Sitemap lastmod reflects generation time, not publication time

OpenAI's sitemap regenerates lastmod timestamps on every crawl cycle. Since metadataOnly is true for OpenAI (Cloudflare blocks page fetches), the system can't verify actual content changes. Result: every run flags 50+ unchanged URLs as "new".

2. LLM over-interprets URL slugs

With metadataOnly mode, titles are derived from URL path segments (e.g., introducing-gpt-5-2 → "Introducing Gpt 5 2"). The prompt labels missing content as "Unable to extract text content", but the LLM still tries to analyze these slug-derived titles as if they were real headlines, producing speculative and potentially misleading analysis.

Example from actual output:

请注意:OpenAI 的抓取列表显示了大量 URL,但主要内容文本提取失败。然而,这些 URL 拼写(如 Introducing Gpt 5 2、Gpt 5 1 Codex Max)本身就构成了极强的战略信号。

Fix

src/web.ts — For metadataOnly sites, skip lastmod-based change detection. Only truly never-seen URLs are treated as new:

// Before: triggers on lastmod changes (false positives for OpenAI)
if (lastmod && lastmod > prev) return true;

// After: only for sites where we can verify content changed
if (!cfg.metadataOnly && lastmod && lastmod > prev) return true;

src/prompts-data.ts — Two improvements:

  1. Replace generic "Unable to extract text content" with explicit metadata-only explanation (in both EN/ZH)
  2. Add clear instruction to the LLM not to speculate on URL slug meanings or fabricate summaries

Result

Metric Before After
"New" OpenAI URLs per run 50+ (false positives) 0 (only genuinely new)
LLM behavior Speculates on URL slugs Reports data limitation clearly

Test plan

  • TypeScript compiles cleanly
  • Verified locally: incremental run detects 0 false-positive new URLs for OpenAI
  • Anthropic (non-metadataOnly) continues to use lastmod-based detection as before

🤖 Generated with Claude Code

Problem:
OpenAI sitemap lastmod values reflect sitemap generation time, not article
publication time. Every run, hundreds of unchanged URLs get flagged as "new"
because their lastmod changed, producing a web report full of URL-slug-derived
titles with no actual content — which the LLM then over-interprets.

Fix:
1. For metadataOnly sites, only treat truly never-seen URLs as new (ignore
   lastmod updates on already-tracked URLs)
2. Clearly label metadata-only items in the prompt so the LLM knows titles
   are URL-slug-derived and no content is available
3. Add explicit instruction in both EN/ZH prompts telling the LLM not to
   speculate on URL slug meanings or fabricate content summaries

Before: every run detected 50+ "new" OpenAI URLs → speculative analysis
After: only genuinely new URLs appear → factual reporting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@duanyytop
Copy link
Copy Markdown
Owner

duanyytop commented Mar 15, 2026

Thanks for the PR, @rollysys! This is a well-targeted fix for a real issue — the lastmod churn on OpenAI's sitemap was indeed causing 50+ false positives per run.

The approach looks solid:

  • Skipping lastmod comparison for metadataOnly sites is the right call since the timestamps reflect sitemap generation time, not actual content changes.
  • The prompt improvements (metadata-only annotation + explicit "do not speculate" directive) should make the LLM output much more reliable.

One minor suggestion for future consideration: the metadata-only warning in the prompt could be driven dynamically by the metadataOnly config flag rather than being hardcoded for OpenAI, so it automatically applies if another site ever needs the same treatment. But that's not a blocker.

LGTM, will merge. Thanks again!

@duanyytop duanyytop merged commit fde93e9 into duanyytop:master Mar 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants