fix: prevent false positives in OpenAI web report from lastmod churn#181
Merged
duanyytop merged 1 commit intoduanyytop:masterfrom Mar 15, 2026
Merged
Conversation
Problem: OpenAI sitemap lastmod values reflect sitemap generation time, not article publication time. Every run, hundreds of unchanged URLs get flagged as "new" because their lastmod changed, producing a web report full of URL-slug-derived titles with no actual content — which the LLM then over-interprets. Fix: 1. For metadataOnly sites, only treat truly never-seen URLs as new (ignore lastmod updates on already-tracked URLs) 2. Clearly label metadata-only items in the prompt so the LLM knows titles are URL-slug-derived and no content is available 3. Add explicit instruction in both EN/ZH prompts telling the LLM not to speculate on URL slug meanings or fabricate content summaries Before: every run detected 50+ "new" OpenAI URLs → speculative analysis After: only genuinely new URLs appear → factual reporting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
|
Thanks for the PR, @rollysys! This is a well-targeted fix for a real issue — the lastmod churn on OpenAI's sitemap was indeed causing 50+ false positives per run. The approach looks solid:
One minor suggestion for future consideration: the metadata-only warning in the prompt could be driven dynamically by the metadataOnly config flag rather than being hardcoded for OpenAI, so it automatically applies if another site ever needs the same treatment. But that's not a blocker. LGTM, will merge. Thanks again! |
This was referenced Mar 16, 2026
This was referenced Mar 24, 2026
This was referenced Apr 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First of all, thank you for building agents-radar! It's an incredibly well-architected project — the pipeline design, the bilingual support, and the sitemap-based web tracking are all excellent. I've learned a lot from the codebase. 🙏
Problem
The OpenAI web report consistently produces unreliable content because of two compounding issues:
1. Sitemap
lastmodreflects generation time, not publication timeOpenAI's sitemap regenerates
lastmodtimestamps on every crawl cycle. SincemetadataOnlyistruefor OpenAI (Cloudflare blocks page fetches), the system can't verify actual content changes. Result: every run flags 50+ unchanged URLs as "new".2. LLM over-interprets URL slugs
With
metadataOnlymode, titles are derived from URL path segments (e.g.,introducing-gpt-5-2→ "Introducing Gpt 5 2"). The prompt labels missing content as "Unable to extract text content", but the LLM still tries to analyze these slug-derived titles as if they were real headlines, producing speculative and potentially misleading analysis.Example from actual output:
Fix
src/web.ts— FormetadataOnlysites, skip lastmod-based change detection. Only truly never-seen URLs are treated as new:src/prompts-data.ts— Two improvements:Result
Test plan
🤖 Generated with Claude Code