Skip to content

Head request is not cached even with cacheable_methods=("GET", HEAD") without a previous GET #337

@Guts

Description

@Guts

Hello,

Thanks for this package, it sounds really useful and fit my needs.
I'm trying to use it to improve the Mkdocs RSS plugin when it comes to retrieve a remote image length as expected by the enclosure tag. For now, a HEAD request is tried to read the value from response content-length header. If it fails,

See related code: https://github.com/Guts/mkdocs-rss-plugin/blob/68c62e5b579b408dbc9999b251bb7c13c562cee8/mkdocs_rss_plugin/util.py#L620-L668

Here comes my quick & dirty dev script to test it quickly:

import http.client
import logging
from pathlib import Path

import requests
from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache

http.client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
req_log = logging.getLogger("requests.packages.urllib3")
req_log.setLevel(logging.DEBUG)
req_log.propagate = True


sess = CacheControl(
    requests.Session(), cache=FileCache(".web_cache"), cacheable_methods=("HEAD", "GET")
)

# head requests
resp_img = sess.head(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)


# try again, cache hit expected
resp_img = sess.head(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

print(list(Path(".web_cache").iterdir()))

But reading the log, I can see that the cache is not used, nor even stored:

DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cdn.geotribu.fr:443
send: b'HEAD /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/1.1\r\nHost: cdn.geotribu.fr\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 25 Jun 2024 10:37:50 GMT
header: Server: Apache/2.4.59 (Ubuntu)
header: Strict-Transport-Security: max-age=15552001; includeSubDomains;
header: Vary: User-Agent,Accept-Encoding
header: Upgrade: h2,h2c
header: Connection: Upgrade, Keep-Alive
header: Last-Modified: Mon, 15 May 2023 11:16:04 GMT
header: ETag: "1784e-5fbb994a06b57"
header: Accept-Ranges: bytes
header: Content-Length: 96334
header: Cache-Control: max-age=31536000, public
header: Keep-Alive: timeout=5, max=100
header: Content-Type: image/png
DEBUG:urllib3.connectionpool:https://cdn.geotribu.fr:443 "HEAD /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/11" 200 0
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:cachecontrol.controller:No cache entry available
send: b'HEAD /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/1.1\r\nHost: cdn.geotribu.fr\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 25 Jun 2024 10:37:50 GMT
header: Server: Apache/2.4.59 (Ubuntu)
header: Strict-Transport-Security: max-age=15552001; includeSubDomains;
header: Vary: User-Agent,Accept-Encoding
header: Last-Modified: Mon, 15 May 2023 11:16:04 GMT
header: ETag: "1784e-5fbb994a06b57"
header: Accept-Ranges: bytes
header: Content-Length: 96334
header: Cache-Control: max-age=31536000, public
header: Keep-Alive: timeout=5, max=99
header: Connection: Keep-Alive
header: Content-Type: image/png
DEBUG:urllib3.connectionpool:https://cdn.geotribu.fr:443 "HEAD /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/11" 200 0
Traceback (most recent call last):
  File "/home/jmo/Git/Perso/mkdocs-rss-plugin/tests/dev/dev_cached_http.py", line 45, in <module>
    print(list(Path(".web_cache").iterdir()))
  File "/usr/lib/python3.10/pathlib.py", line 1017, in iterdir
    for name in self._accessor.listdir(self):
FileNotFoundError: [Errno 2] No such file or directory: '.web_cache'

BUT if I make a GET request to the same resource before the HEAD, the cache is stored AND even read for the HEAD:

import http.client
import logging
from pathlib import Path

import requests
from cachecontrol import CacheControl
from cachecontrol.caches.file_cache import FileCache

http.client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
req_log = logging.getLogger("requests.packages.urllib3")
req_log.setLevel(logging.DEBUG)
req_log.propagate = True


sess = CacheControl(
    requests.Session(), cache=FileCache(".web_cache"), cacheable_methods=("HEAD", "GET")
)


# get requests
resp = sess.get("https://geotribu.fr")
resp_img = sess.get(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

# try again, cache hit expected
resp = sess.get("https://geotribu.fr")
resp_img = sess.get(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

# head requests
resp_img = sess.head(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)


# try again, cache hit expected
resp_img = sess.head(
    "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
)

print(list(Path(".web_cache").iterdir()))

Logs:

DEBUG:cachecontrol.controller:Looking up "https://geotribu.fr/" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): geotribu.fr:443
send: b'GET / HTTP/1.1\r\nHost: geotribu.fr\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Connection: keep-alive
header: Content-Length: 36931
header: Server: GitHub.com
header: Content-Type: text/html; charset=utf-8
header: Last-Modified: Mon, 24 Jun 2024 17:15:31 GMT
header: Access-Control-Allow-Origin: *
header: ETag: W/"6679a9b3-1c64a"
header: expires: Tue, 25 Jun 2024 05:36:20 GMT
header: Cache-Control: max-age=600
header: Content-Encoding: gzip
header: x-proxy-cache: MISS
header: X-GitHub-Request-Id: 9410:182824:28A2A85:2986177:667A54FB
header: Accept-Ranges: bytes
header: Age: 311
header: Date: Tue, 25 Jun 2024 10:42:09 GMT
header: Via: 1.1 varnish
header: X-Served-By: cache-par-lfpg1960074-PAR
header: X-Cache: HIT
header: X-Cache-Hits: 0
header: X-Timer: S1719312129.348531,VS0,VE1
header: Vary: Accept-Encoding
header: X-Fastly-Request-ID: 4197c8a81421f5a2aed12f7abecd529eb36a01bb
DEBUG:urllib3.connectionpool:https://geotribu.fr:443 "GET / HTTP/11" 200 36931
DEBUG:cachecontrol.controller:Updating cache with response from "https://geotribu.fr/"
DEBUG:cachecontrol.controller:etag object cached for 1209600 seconds
DEBUG:cachecontrol.controller:Caching due to etag
DEBUG:filelock:Attempting to acquire lock 135385395530768 on .web_cache/a/e/3/4/8/ae3480509cd61ad7727667c13727f5909a71b094423dffa3730b3ab8.lock
DEBUG:filelock:Lock 135385395530768 acquired on .web_cache/a/e/3/4/8/ae3480509cd61ad7727667c13727f5909a71b094423dffa3730b3ab8.lock
DEBUG:filelock:Attempting to release lock 135385395530768 on .web_cache/a/e/3/4/8/ae3480509cd61ad7727667c13727f5909a71b094423dffa3730b3ab8.lock
DEBUG:filelock:Lock 135385395530768 released on .web_cache/a/e/3/4/8/ae3480509cd61ad7727667c13727f5909a71b094423dffa3730b3ab8.lock
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): cdn.geotribu.fr:443
send: b'GET /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/1.1\r\nHost: cdn.geotribu.fr\r\nUser-Agent: python-requests/2.32.3\r\nAccept-Encoding: gzip, deflate\r\nAccept: */*\r\nConnection: keep-alive\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 25 Jun 2024 10:42:09 GMT
header: Server: Apache/2.4.59 (Ubuntu)
header: Strict-Transport-Security: max-age=15552001; includeSubDomains;
header: Vary: User-Agent,Accept-Encoding
header: Upgrade: h2,h2c
header: Connection: Upgrade, Keep-Alive
header: Last-Modified: Mon, 15 May 2023 11:16:04 GMT
header: ETag: "1784e-5fbb994a06b57"
header: Accept-Ranges: bytes
header: Content-Length: 96334
header: Cache-Control: max-age=31536000, public
header: Keep-Alive: timeout=5, max=100
header: Content-Type: image/png
DEBUG:urllib3.connectionpool:https://cdn.geotribu.fr:443 "GET /img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png HTTP/11" 200 96334
DEBUG:cachecontrol.controller:Updating cache with response from "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png"
DEBUG:cachecontrol.controller:etag object cached for 1209600 seconds
DEBUG:cachecontrol.controller:Caching due to etag
DEBUG:filelock:Attempting to acquire lock 135385395533696 on .web_cache/8/d/2/2/c/8d22c12b7f7b5166ec3853b07d4b8dc0cb5c13a8eba0e19e34e4fb70.lock
DEBUG:filelock:Lock 135385395533696 acquired on .web_cache/8/d/2/2/c/8d22c12b7f7b5166ec3853b07d4b8dc0cb5c13a8eba0e19e34e4fb70.lock
DEBUG:filelock:Attempting to release lock 135385395533696 on .web_cache/8/d/2/2/c/8d22c12b7f7b5166ec3853b07d4b8dc0cb5c13a8eba0e19e34e4fb70.lock
DEBUG:filelock:Lock 135385395533696 released on .web_cache/8/d/2/2/c/8d22c12b7f7b5166ec3853b07d4b8dc0cb5c13a8eba0e19e34e4fb70.lock
DEBUG:cachecontrol.controller:Looking up "https://geotribu.fr/" in the cache
DEBUG:cachecontrol.controller:Current age based on date: 0
DEBUG:cachecontrol.controller:Freshness lifetime from max-age: 600
DEBUG:cachecontrol.controller:The response is "fresh", returning cached response
DEBUG:cachecontrol.controller:600 > 0
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:Current age based on date: 0
DEBUG:cachecontrol.controller:Freshness lifetime from max-age: 31536000
DEBUG:cachecontrol.controller:The response is "fresh", returning cached response
DEBUG:cachecontrol.controller:31536000 > 0
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:Current age based on date: 0
DEBUG:cachecontrol.controller:Freshness lifetime from max-age: 31536000
DEBUG:cachecontrol.controller:The response is "fresh", returning cached response
DEBUG:cachecontrol.controller:31536000 > 0
DEBUG:cachecontrol.controller:Looking up "https://cdn.geotribu.fr/img/articles-blog-rdp/capture-ecran/kevish_Air-Traffic.png" in the cache
DEBUG:cachecontrol.controller:Current age based on date: 0
DEBUG:cachecontrol.controller:Freshness lifetime from max-age: 31536000
DEBUG:cachecontrol.controller:The response is "fresh", returning cached response
DEBUG:cachecontrol.controller:31536000 > 0
[PosixPath('.web_cache/8'), PosixPath('.web_cache/a')]

Possibly related issue: #216

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions