Skip to content

Support ETag for cache validation when Last-Modified is unavailable #668

@altendky

Description

@altendky

Summary

When a remote schema endpoint provides an ETag header but no Last-Modified header, check-jsonschema currently treats the cache as always-fresh, leading to stale schemas being served indefinitely.

Current Behavior

The cache hit logic in cachedownloader.py relies solely on Last-Modified:

def _cache_hit(cachefile: str, response: requests.Response) -> bool:
    if not os.path.exists(cachefile):
        return False
    local_mtime = os.path.getmtime(cachefile)
    remote_mtime = _lastmod_from_response(response)
    return local_mtime >= remote_mtime

When Last-Modified is missing, _lastmod_from_response() returns 0.0:

def _lastmod_from_response(response: requests.Response) -> float:
    try:
        return calendar.timegm(
            time.strptime(response.headers["last-modified"], _LASTMOD_FMT)
        )
    except (OverflowError, ValueError, LookupError):
        return 0.0

Since any cached file's mtime is >= 0.0, the cache never invalidates.

Real-World Impact

Mergify's schema endpoint (https://docs.mergify.com/mergify-configuration-schema.json) provides ETag but not Last-Modified:

$ curl -sI https://docs.mergify.com/mergify-configuration-schema.json | grep -E '^(cache-control|etag|last-modified):'
cache-control: public, max-age=600, no-transform
etag: "afd19c79c195c2e76f1d37bd12421d88"

When Mergify adds new config options, users get false validation failures until they manually clear ~/.cache/check_jsonschema/.

I've filed Mergifyio/mergify#5161 requesting they add Last-Modified, but this is likely a common pattern for CDN-served content (Cloudflare in their case).

Suggested Enhancement

Support ETag as a fallback when Last-Modified is unavailable:

  1. Store the ETag value alongside cached files (e.g., in a .etag sidecar file or a metadata store)
  2. On subsequent requests, send If-None-Match: <stored-etag> header
  3. If server returns 304 Not Modified, treat as cache hit
  4. If server returns 200 with new ETag, update cache and stored ETag

This follows standard HTTP caching semantics where clients should support both Last-Modified/If-Modified-Since and ETag/If-None-Match.

Alternatives Considered

  • Require servers to set Last-Modified: Not always under user control, especially for third-party schemas
  • Use Cache-Control: max-age: Already present in some responses (e.g., max-age=600), could be used as a TTL, though this would mean re-downloading more frequently than necessary when content hasn't changed

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions