Skip to content

Conversation

@TomAugspurger
Copy link
Collaborator

Closes #215

WIP for now, I need to incorporate #215 (comment).

@TomAugspurger
Copy link
Collaborator Author

@martindurant Is there a standard schema for what entries in DirCache should look like? I see that github and ftp use them. In fsspec/spec.py, it's only used in _ls_from_cache, which seems to expect something like

{
    "path": {
        "name": name,
    }
} 

Standardizing the structure here would be good (maybe in a dataclass?), but I'm not sure what all to include yet.

@TomAugspurger
Copy link
Collaborator Author

For ftp, elements of DirCache are List[Tuple[str, Dict]], where the first item of the tuple is the path of the element, and the dict has a schema like modify, perm, size, file, unique, name.

(Pdb) pp out[:2]
[('__init__.py',
  {'modify': '20190813183127',
   'name': '/__init__.py',
   'perm': 'r',
   'size': 0,
   'type': 'file',
   'unique': '1000004g2058413d7'}),
 ('__pycache__',
  {'modify': '20191127162327',
   'name': '/__pycache__',
   'perm': 'el',
   'size': 0,
   'type': 'dir',
   'unique': '1000004g206bad42b'})]

those are all the files under the path (the key).

For github we just have a List[Dict], and the keys in the dict are name, mode, type, size, sha.

These are inconsistent. At the moment, I'm leaning toward a namedtuple structure like

CacheItem = namedtuple("CacheItem", ["name", "details"])

where name is a string, and details is a dict with anything. Hopefully that will suffice.

@martindurant
Copy link
Member

The canonical structure should be:

{'cached_path`: [
    {"name": 'file_path", 
     "size": 10,
     "type": "file"},
    ...
   ]
}

The FTP case is clearly based on the output of the client library, and ought to be processed into canonical form, as it done for s3, gcs...

@martindurant
Copy link
Member

martindurant commented Dec 5, 2019

i.e., the key is the path that we did a listing for
(but I'm fine with the inner structure being dict-like too, so we can find an entry quickly; however, it may be possible on, e.g., s3, that two identical names exist, one as a prefix and one as a file)

@martindurant
Copy link
Member

This still looks useful to me

@martindurant
Copy link
Member

Superceded by #243

@TomAugspurger TomAugspurger deleted the dircache branch December 22, 2020 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Standardize dircache timeout

2 participants