-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Occasional errors on AppVeyor due to writing .mypy_cache #3215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This probably has the same root cause as this; TLDR: Windows file system inherently leads to intermittent race conditions on write access attempts, even without any other processes, antivirus, etc. running. If this is a risk in production, retrying the For AppVeyor, the simplest solution seems to be to make this change in
|
Fun. I followed some older issues to https://mail.python.org/pipermail/python-dev/2008-April/078333.html and my recollection is that it plagued us even in the 2000-2003 era (when I shared an office at PythonLabs with Tim Peters). But "Windows file system inherently leads to intermittent race conditions on write access attempts, even without any other processes, antivirus, etc." sounds awfully vague. Is there a Microsoft KB issue about this? Is it really the FS or is it Microsoft's libc emulation? Anywho, I don't like making an exception in mypy itself just for AppVeyor -- if it happens to AppVeyor it may happen to others (also it may interfere with mypy runs by others on AppVeyor). Maybe the whole always-write-cache thing needs to be disabled by the test framework. Or maybe we need a separate "full-run-but-write-cache" flag instead of the default behavior, then Dropbox can enable that (we never run mypy on Windows) and everyone else is freed of this. |
I think that we should understand the root cause before making a decision. If the correct fix is to add a retry loop on Windows, then we should probably just do the right thing here, always, even if it sounds like an ugly hack. We could have a utility function that does the retry thing (e.g. |
I meant without any user processes, sorry. Yes, of course, ultimately it's another process holding the handle to the Windows file that prevents it from being deleted; I was just saying that Windows system-level processes (file indexer, analyzer of file type, etc.) could be to blame, and there's almost no way to reliably turn all of them off. A couple more references. Apart from the solution with wait/retry loop that I mentioned, I found nothing else recommended by people who spent time on this issue. Should we try to implement it and see if it solves the problem without too much extra delay? |
Yes, I think that we should at least see if a wait/retry loop seems to fix the issue. The only major drawback to such a workaround that I can see would be slowdown when things are failing consistently, but that should be rare. Discussion at the first referenced article suggests that we may have to wait 2 seconds or more. It's likely that anything much longer than that could be disruptive, though. |
OK, now I understand. It is another process, just nothing we can control. I agree we should try a retry loop. This should be implemented as a set of utility functions wrapping standard os functions that do the retry loop only on Windows. That way they can be called from various places and we don't have to uglify the call sites (I suspect there are a few different call sites and os functions that may have this problem). Nevertheless I would also like to turn off cache writing when the test suite runs. Finally I would like to make the cache-writing code more robust, so that if at any point the cache write fails (maybe because I replaced .mypy_cache or one of its subdirectories with a plain file or with a symlink to a non-writable directory) the mypy run doesn't crash like it does now. (In fact this may be a substitute for the retry loop? If the cache has problems we silently don't write it. The next run will regenerate the cache. That's how Python write .pyc files.) |
Windows race conditions only affect file deletion / renaming, not file writing. Separately, I noticed that |
My understanding so far:
The best solution I can think of for atomic write is based on this:
This has no race conditions. Should I implement it? |
I think that looks reasonable. Steps 1-2 are exactly as we do them today. Steps 3-5 are the Windows version of Are you proposing to do this just on Windows or also on Unix? At least on Unix I'd worry that tmpdir is on a different fs than original_filename -- that's why the current code doesn't use the tmpdir module but instead uses its own random string that's appended to the filenames. I still think it makes sense to have a replace() function in mypy/util.py that just calls os.replace() on Unix but does the other dance on Windows (and it can arrange for the lazy deletions using sys.atexit). |
Yup, just on Windows. On Unix, |
It turns out that in order to rename a Windows file that's open in another process, one has to use a 3-argument version of We don't have a function in python standard library that uses a 3-argument version of Note that this is completely unrelated to issues of whether I can implement a retry loop, unless someone who's done that before already has a ready to use code. |
Uhh...can't you use |
Ah I didn't know you can call Windows API from pure python 😲 Yes, I managed to call it, so I'm testing out whether I can achieve the desired behavior. |
I recently learned about https://github.com/untitaker/python-atomicwrites (because it has a stub in typeshed :) ). Is their approach useful? |
@JelleZijlstra I think not. Their support for Windows is based on the function ( |
But why couldn't we assume the same access? |
Well supposedly there was a function that allows to rename even if delete isn't allowed. I thought maybe they figured out how to use it properly. But actually they are doing precisely what we are already doing - and since they don't do the wait/retry loop, the behavior of their approach should be identical to ours (in that it will intermittently fail). Sadly, my tests show that the holy grail of Windows rename is just a myth. So I'll implement the wait/retry loop after all. |
Thanks for the thorough research! |
For the wait/retry loop, I'll borrow the solution from CPython test support module; see discussion about it here. Unfortunately, I can't just Another consideration is whether we actually need to block until the atomic write completed. At present, the answer is that we don't need to block right away since we don't need to read from those files for a while (we can wait at least until the end of The cleanest solution would be to refactor the An alternative is to wait for each file, but if total wait time exceeds a threshold, exit and warn user that something is really wrong with access rights. Finally, I can simply wait for each file, and not worry about the case that multiple files are permanently stuck. Anyone has an opinion? |
If something fails to get deleted, we can just fail the entire build and not try anything else? If the write succeeds but the replace fails, the user has a configuration problem -- we have partial write access to the cache directory. We can detect this and just fail immediately instead of trying to write additional cache files. There will be at most a one-second delay, which is reasonable. Here are the most likely scenarios if I've understood this correctly:
|
I'm not 100% sure I understood you, so I'll summarize what we know and then respond to your bullet points. The original problem is the intermittent
They work perfectly on Linux. The problem on Windows is caused by the combination of two things:
If It's a well-known problem. Among others, CPython test suite failures were caused by the same issue, and they were addressed by a solution (undocumented but visible in
No, the second write may fail too. The lock is on
Yes.
Unfortunately, this is likely not user configuration problem, but just that the lock on |
OK, so the solution really is just to retry with exponential backoff and a 1 sec timeout, but it's unlikely to happen, so we won't have to worry about all writes taking an extra 1 sec per file. Then all I recommend is simplifying the wait function to the bare minimal, per my previous comment. |
I think this can be closed. No more crashes on AppVeyor as far as I can tell. |
Since #3133 was merged I see occasional test failures in AppVeyor (i.e. on Windows) with the traceback below. I suspect this is due to some background process on Windows (e.g. a virus checker or another container?) still having the file open, but a simple (?) solution would be to add
--cache-dir=nul
to the mypy invocations on Windows. (Though that will need to be done somewhere in the test framework I suspect.)For now I'm just going to ignore such failures.
(UPDATE: example taken from https://ci.appveyor.com/project/gvanrossum/mypy/build/1.0.1227/job/9q6qgru4f6ig6ar8 )
The text was updated successfully, but these errors were encountered: