-
Notifications
You must be signed in to change notification settings - Fork 61
Description
Summary
When a remote schema endpoint provides an ETag header but no Last-Modified header, check-jsonschema currently treats the cache as always-fresh, leading to stale schemas being served indefinitely.
Current Behavior
The cache hit logic in cachedownloader.py relies solely on Last-Modified:
def _cache_hit(cachefile: str, response: requests.Response) -> bool:
if not os.path.exists(cachefile):
return False
local_mtime = os.path.getmtime(cachefile)
remote_mtime = _lastmod_from_response(response)
return local_mtime >= remote_mtimeWhen Last-Modified is missing, _lastmod_from_response() returns 0.0:
def _lastmod_from_response(response: requests.Response) -> float:
try:
return calendar.timegm(
time.strptime(response.headers["last-modified"], _LASTMOD_FMT)
)
except (OverflowError, ValueError, LookupError):
return 0.0Since any cached file's mtime is >= 0.0, the cache never invalidates.
Real-World Impact
Mergify's schema endpoint (https://docs.mergify.com/mergify-configuration-schema.json) provides ETag but not Last-Modified:
$ curl -sI https://docs.mergify.com/mergify-configuration-schema.json | grep -E '^(cache-control|etag|last-modified):'
cache-control: public, max-age=600, no-transform
etag: "afd19c79c195c2e76f1d37bd12421d88"
When Mergify adds new config options, users get false validation failures until they manually clear ~/.cache/check_jsonschema/.
I've filed Mergifyio/mergify#5161 requesting they add Last-Modified, but this is likely a common pattern for CDN-served content (Cloudflare in their case).
Suggested Enhancement
Support ETag as a fallback when Last-Modified is unavailable:
- Store the
ETagvalue alongside cached files (e.g., in a.etagsidecar file or a metadata store) - On subsequent requests, send
If-None-Match: <stored-etag>header - If server returns
304 Not Modified, treat as cache hit - If server returns
200with newETag, update cache and stored ETag
This follows standard HTTP caching semantics where clients should support both Last-Modified/If-Modified-Since and ETag/If-None-Match.
Alternatives Considered
- Require servers to set
Last-Modified: Not always under user control, especially for third-party schemas - Use
Cache-Control: max-age: Already present in some responses (e.g.,max-age=600), could be used as a TTL, though this would mean re-downloading more frequently than necessary when content hasn't changed
References
- HTTP Conditional Requests: https://developer.mozilla.org/en-US/docs/Web/HTTP/Conditional_requests
- Related Mergify issue: Schema endpoint missing Last-Modified header, causing stale cache issues with check-jsonschema Mergifyio/mergify#5161