-
Notifications
You must be signed in to change notification settings - Fork 382
Cached filesystem not concurrency-safe? #1107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You may well be right. The caching metadat should be concurrent safe (i.e., async), but not thread safe. The writing of the file could be protected by a lock, or could use filesystem |
Thank you @martindurant for the quick response. Do you think you would implement one of those mechanisms in some near future to make fsspec thread safe in that regard? |
It is a reasonable request, but I don't think anyone has such plans right now. |
ok, if you were to do it -- would threading lock solution be considered? (I think making |
The cache metadata should be conservative, so that races lead to occasional unnecessary re-fetches of data that were cached but the info not persisted. That's OK, if it is relatively rare. The problem with the lock, is that it will still be vulnerable to multiple processes using the same cache directory. |
ok, @jwodder - could you please implement a |
The filesystem_spec/fsspec/implementations/cached.py Lines 187 to 191 in acad158
The only other method that writes to the cache is |
I wonder if that |
Yes, I agree with that pattern - it's exactly what I've had to do elsewhere recently. This is what happens when you develop on a machine with exactly one disk :) |
so seems like it begs for some reusbale helper like
which would take care about all the renames etc at the target |
Now the FUSEd cached filesystem is failing with:
|
huh, that file exists, but is big. Let's remove that test, I don't remember why it's around. Perhaps we can think of some other very static URL that's likely to be around for a long time. |
@martindurant Did you mean to comment on a different issue? My comment has nothing to do with fsspec's tests. |
Oh, well if it's not one of our tests... |
@martindurant I was referring to the in-production behavior of the program I described in my original comment at the start of this issue. |
|
@yarikoptic I'm not sure that the current error is a datalad-fuse error. I suspect it may be a side effect of multiple FUSE operations writing to the fsspec cache file at once, resulting in some entries being randomly lost as cache file A is replaced by cache file B, where file B was prepared from data generated before file A was written. |
ah, ok so those situations @martindurant mentioned above. May be we would be doomed to not rely on their (our internal and fsspec cache) 1-to-1 correspondence... will check more in detail later unless someone beats me to it (about to take off in a plane without internet) |
@jwodder could you please try with that overall I feel that there would not be solution without introducing thread-safety in manipulation of the cache here. atomic writes (#1111) are good to have but not a full solution indeed and would require analysis of the code paths regarding modifications of |
@yarikoptic Do you mean just suppress the error in datalad-fuse's code? |
Yes |
The program is running without errors now, though datalad-fuse is emitting a number of warnings like:
which may or may not suggest that #1078 wasn't fully resolved, even though the problem no longer arises on the MVCE I posted there. |
|
The program eventually crashed with the following error from the FUSE process:
The second error was repeated several times. I'm unsure exactly what happened internally. |
would eliminate the first issue The second implies that |
@martindurant Adding As for the second error, it's clear that |
casting into list is right there a few lines down: see e.g. https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/cached.py#L186 cache = {k: v.copy() for k, v in cached_files.items()}
for c in cache.values():
if isinstance(c["blocks"], set):
c["blocks"] = list(c["blocks"]) which was added in a6d96f7 . Unfortnately it is not clear to me the reason for the casting -- may be @martindurant you remember and see how we should get out of this "pickle"? ;) |
It was a long time ago... Since it's in the save function, maybe it was to allow JSON serialisation. |
as no more json there but rather pickle -- I guess it should be safe to remove that right?
I also can't grasp what it is about, but I guess since we run into it only in our case, has smth to do with multithreading -- may be some other code path manages to modify it somehow? (although I don't see how). As for identity -- I think since |
@yarikoptic The casting to list is performed on a copy of the cache that is then pickled, and the lists are converted back to sets when un-pickled; as far as I can tell, the cache itself should always contain sets. |
I'm sorry, my family is ill, not sure when I can get back to you on this. Please try pinging me in a couple of days if you've made no progress. |
no problem @martindurant -- get better!
I do not see any casting to |
@yarikoptic If the in-memory cache ( filesystem_spec/fsspec/implementations/cached.py Lines 171 to 173 in d483cca
Additionally, there's this in filesystem_spec/fsspec/implementations/cached.py Lines 143 to 145 in d483cca
|
ok, anyways -- it seems that there is no longer need for |
eh, I should have spotted it -- AFAIK #1111 is not completely addressing this issue (concurency-safety) so it should have not been closed with that PR. |
We have a program that uses fsspec to mount a cached HTTP filesystem as a FUSE mount; while this process runs in the background, another traverses the FUSE mount and inspects multiple files in parallel. Unfortunately, the FUSE process keeps hitting errors in the following form:
I suspect the cause is that the cached filesystem is not safe for concurrent access, with the result that multiple actions on the filesystem are causing the cache file to be read by one procedure while another writes to it. Indeed, there seems to be no locking in the cached filesystem code.
The text was updated successfully, but these errors were encountered: