Skip to content

Reliable version of os.replace for Windows (wait/retry loop) #3239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

pkch
Copy link
Contributor

@pkch pkch commented Apr 25, 2017

Fix #3215

Copying my comment from there:

Do we need to block until the atomic write completes? At present, the answer is that we don't need to block right away since we don't need to read from those files for a while (we can wait at least until the end of process_stale_scc). So we can postpone all the failed writes until some checkpoint in the future, greatly reducing the worst-case delay since we can wait for all the "stuck" files at once. However, this means that the wait/retry will be run from a faraway location in the code. This makes this approach dangerous: who knows where else write_cache() might be used in the future? In fact, there's already a *#*TODO item to add it to update.build_incremental_step(). (Note that write_cache saves a single module, not the entire cache.)

The cleanest solution would be (1) to refactor the write_cache to require a list of files to be written and wait for completion at the end of that function. This would have the best of both worlds: not wait for each file, and prevent future contributors from accidentally writing subtly buggy code that relies on writes that haven't happened yet.

An alternative is (2) to wait for each file, but if total wait time exceeds a threshold, exit and warn user that something is really wrong with access rights.

Finally, I can simply (3) wait for each file, and not worry about the case when multiple files are permanently stuck.

For now I used (3), but made the wait_for API to support refactoring into (1) if it's desired.

@gvanrossum
Copy link
Member

Honestly the wait_for() function seems pretty over-engineered. Its only known call site (outside tests) has only one function and one exception. I think we can make it simpler.

I still pondering Jukka's comment in the issue but TBH I'm not sure I follow the scenarios. How likely is it that we're in a scenario where we end up waiting a second for each cache file we write? That would be pretty horrible, but does it really happen? After all the issue only noticed occasional AppVeyor failures.

I have some other nits but I think we should decide on the high-level approach first. Should we really change the behavior of write_cache() (which was pretty carefully designed when it comes to avoiding corrupt files no matter at what point the process is killed hard, assuming reasonable filesystem semantics) or is just retrying the replace() until it succeeds enough?

@pkch
Copy link
Contributor Author

pkch commented Apr 25, 2017

Honestly the wait_for() function seems pretty over-engineered. Its only known call site (outside tests) has only one function and one exception. I think we can make it simpler.

Sure, but I thought we may also want to use it for os.remove. There are few of those in the codebase and we haven't discussed them yet. I suspect permission errors on them either cause the same issue as os.replace, or are ignored which I am not sure is safe.

How likely is it that we're in a scenario where we end up waiting a second for each cache file we write?

User misconfiguration. For example, cache folder has write permission so new temp files can be created, but the existing cache files are owned by another user so they can never be replaced.

After all the issue only noticed occasional AppVeyor failures.

Yes, but you said we wanted to solve it not just for AppVeyor (which has a near-perfect setup) but for Windows users in general who have many different situations (bad configuration, many more processes running, etc.).

@gvanrossum
Copy link
Member

Let's have a simple wait function now -- we can refactor it when we need it for the remove calls (those haven't been failing AFAIK).

For example, cache folder has write permission so new temp files can be created, but the existing cache files are owned by another user so they can never be replaced.

Then the first replace() call will time out, which will raise an exception, which AFAIK isn't caught. So in this case we should only be waiting 1 sec extra before getting an error. Right?

@pkch
Copy link
Contributor Author

pkch commented Apr 25, 2017

Then the first replace() call will time out, which will raise an exception, which AFAIK isn't caught. So in this case we should only be waiting 1 sec extra before getting an error. Right?

Yes precisely.

remove calls haven't been failing AFAIK

They are not because the PermissionError is caught. Can you double check that this doesn't cause incorrect semantics (leaving bad files in place for use in later stages)?

@gvanrossum
Copy link
Member

Can you double check that this doesn't cause incorrect semantics (leaving bad files in place for use in later stages)?

Hm, let me see. (Literally just writing as I reason through this so you can check for yourself if there's a flaw in this argument.)

At this point the only thing we've done to the filesystem is the makedirs() call, which is idempotent. We've also computed the string to be written to the data file but haven't written it to the file yet. Now we're trying to compute the contents of the meta file and the crucial input, mtime+size of the source file (path) is unavailable due to a stat() error.

That file existed before (or we wouldn't have gotten this far). What we're doing here is mostly a slight optimization then -- cleaning up cache files for a source file that no longer exists. Presumably if the source file reappears it will have a different mtime so the cache files will be invalid anyway. Or if it is restored from a backup we're pessimizing things slightly. I think the reason this code exists at all is to clean the cache of irrelevant entries.

The try/except is meant to avoid needing other checks in case the cache files don't exist at all.

When could ignoring the error cause trouble? I guess the only interesting scenario is when there was a valid data/meta pair and somehow we delete the data but leave the meta.

On a subsequent run, assuming if the file is restored from backup (otherwise we would rule the meta file invalid without ever looking at the data file), the is_meta_fresh() function would get to the point where it tries to call getmtime() on the data file:

if os.path.getmtime(meta.data_json) != meta.data_mtime:
. That stat() call would then fail because the data file doesn't exist, so the run would crash -- recovery from that would be deleting the cache (or just that particular meta file).

Now suppose we add error handling to the cache handling code -- then that getmtime() call would still fail but we'd presumably catch the error -- and then is_meta_fresh() should rule the meta file stale and return False, at which point we're still fine.

Concluding I don't think that the code you flagged can cause incorrect semantics -- but thanks for asking and it was a nice puzzle!

@pkch
Copy link
Contributor Author

pkch commented Apr 25, 2017

I agree. I would summarize your argument by saying that before anyone relies on a cache file, they call is_meta_fresh, and that function can't be tricked by deleting / not deleting some file since it doesn't assume anything (it verifies everything from scratch).

I was also afraid that someone might call self.write_cache(), assume that since we just wrote it, the cache must now be fresh, and start relying on it without checking is_meta_fresh(). (This would be a bug because if os.remove failed by accident, we left a cache file that doesn't correspond to a real file.)

It never happens in the current version, and based on the code style I see, I don't see much risk it will happen in the future: is_meta_fresh() is a standard gateway to doing anything with cache. That said, I almost feel like calling is_meta_fresh from inside find_cache_meta, and making find_cache_meta return None if is_meta_fresh returns False. No big deal though.

@gvanrossum
Copy link
Member

I almost feel like calling is_meta_fresh from inside find_cache_meta

I believe there's a very good reason why these two are separated, having to do with the different phases of processing. API design is hard!

@pkch pkch changed the title Wait/retry when cannot delete file on Windows Reliable version of os.replace for Windows (wait/retry loop) Apr 25, 2017
@pkch
Copy link
Contributor Author

pkch commented Apr 26, 2017

I am bit concerned about my tests; they rely on pretty tight timing constraints (0.1 sec < 0.25 sec < 0.4 sec). While normally, this is more than enough of a difference relative to random noise and disk I/O, on a slow VM with multiple processes running in parallel and/or heavy disk activity it might actually fail intermittently.

At the same time, I don't want to delete the tests, or make them much longer (since these delays add onto the total test time).

I guess I'll use a separate pair of test src / dest files for each test, so that the waiting time for locks to expire after the test already passed isn't done sequentially. This will give me more room to increase the duration of locks and timeouts used for tests.

@gvanrossum
Copy link
Member

I am bit concerned about my tests

As you should be, given the (ironic) test failures...

pkch added 3 commits April 25, 2017 17:32
xdist isn't flexible enough to run a given set of tests in one process
If these tests are split into multiple processes, they will take a lot
longer since each will wait for locks to die out at the end.
mypy/util.py Outdated


def _replace(src: PathType, dest: PathType) -> None:
repl = cast(Callable[[], None], partial(os.replace, src, dest))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use a lambda? Shouldn't even need the cast:

repl = lambda: os.replace(src, dest)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh duh.. This code is no longer in the current version, but I'm glad I won't be casting partial as often in the future.

try:
import collections.abc as collections_abc
except ImportError:
import collections as collections_abc # type: ignore # PY32 and earlier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't support PY32 any more.

start_time = time.perf_counter()

def f() -> None:
if time.perf_counter() - start_time < lag:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this would be more readable as if time.perf_counter() < start_time + lag: -- then I can see immediately that it's taking this branch when called before lag time has passed.

class WaitRetryTests(TestCase):
def test_waitfor(self) -> None:
with self.assertRaises(OSError):
util.wait_for(create_funcs(), (PermissionError, FileExistsError), 0.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the exceptions should be in a list, not a tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good to know, somehow I thought exception syntax was the same as isinstance, only tuples allowed :)


class WaitRetryTests(TestCase):
def test_waitfor(self) -> None:
with self.assertRaises(OSError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly recommend adding a comment to each subtest (i.e. each util.wait_for() call, possibly wrapped in a context manager) explaining what it is for.



def lock_file(filename: str, duration: float) -> Thread:
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you stick to the prevailing docstring style?

    """Opens filename (which must exist) for reading.

    After duration sec, release the handle.
    """

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though honestly the first line should probably say something like

    """Open a file and keep it open in a background thread for a while."""


@skipUnless(WIN32, "only relevant for Windows")
class ReliableReplace(TestCase):
# will be cleaned up automatically when this class goes out of scope
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that would depend on Python finalization order which is a nightmare.

Can't you make this more explicit? E.g. in tearDownClass() below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't know that.. scary, I always relied on that. Fixed.

mypy/util.py Outdated


if sys.version_info >= (3, 6):
PathType = Union[AnyStr, os.PathLike]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.PathLike is also generic in AnyStr. But by not mentioning that here you'll get it instantiated with Any instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I wonder if perhaps PathType should be availalbe from types stdlib module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking of proposing to add it to typing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually python/typing#402 that seems related.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually it's not going to work in runtime because PathLike derives only from abc.ABC, and there's no equivalent thing exported from typing. I'll use AnyStr for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkch Maybe a string literal 'PathLike[AnyStr]'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilevkivskyi it works! I thought strings can only help with forward declarations, but I guess they just tell runtime not to worry about it whatever the cause might be.

Copy link
Member

@ilevkivskyi ilevkivskyi Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works until someone calls get_type_hints on this function. Generally, this is considered a temporary workaround, for example we prohibit things like 'collections.Counter[str]', but here I think it is OK.

Also note, that when you write PathType = Union[AnyStr, 'os.PathLike[AnyStr]'], you create a generic type alias (since AnyStr is a type variable), so that when you write just PathType, it will be translated to Union[Any, os.PathLike[Any]]. What you want is probably:

def _replace(src: PathType[AnyStr], dest: PathType[AnyStr], timeout: float = 10) -> None: ...

mypy/util.py Outdated
PathType = AnyStr


def _replace(src: PathType, dest: PathType, timeout: float = 10) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you don't parameterize the PathType types, they'll be using Any instead of AnyStr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make this mistake 3 times a day. I guess I'll update #3141 (--warn-implicit-any) in case anyone else has this problem.

mypy/util.py Outdated
Increase wait time exponentially until total wait of timeout sec
On timeout, give up and reraise the last exception seen
'''
n_iter = max(1, math.ceil(math.log2(timeout / 0.001)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you bound the iteration count without using the math number? ISTM you can just use something like

wait = 0.001
while wait <= 0.5:
    <stuff>
    time.sleep(wait)
    wait = wait*2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can, but if I use the no-math version, the final iteration will have wait time anywhere between 0.25 and 0.5 sec, so the total timeout is only precise up to a multiplicative factor of 2. That precision might be fine for our purposes, but the tests become really inefficient. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you can't generate exactly the same sequence of wait times using either algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I need to know what wait should start from. I'd like to start it from somewhere around 0.001 sec, but of course I don't care about the precise value. I do care about the total timeout though, which means I care about the final (rather than starting) value of the wait. Specifically, if my timeout is say 1 sec, I want my final wait just before I give up, to be 0.5 sec (that way the total wait will be 0.5 + 0.5/(2**1) + 0.5/(2**2) + ... = 1).

But in order to get the correct starting value without math, I'd need to create a second loop, where I repeat timeout /= 2 until it gets to (say) below 0.001. I can do that, but is it better than calling math.log?

Actually, it might be easier to read, let me commit such a version, and let's decide which is better.

@gvanrossum
Copy link
Member

Do you need to sync typeshed as part of this PR? That's usually an antipattern.

@pkch
Copy link
Contributor Author

pkch commented Apr 26, 2017

Hmm AppVeyor refused to build due to unmergeable branch until I commit typeshed.

I'm surprised too, this is only the second time it's happening since I've been working with mypy (the first time was over a month ago). Let me commit without typeshed, and we can figure out what's going on.

@gvanrossum
Copy link
Member

Oh, maybe it was due to the typeshed mess-up in a recent PR. It's all fixed now.

@gvanrossum
Copy link
Member

(The comment doesn't seem to have a "reply" link so responding here.)

Why is the exact duration of the timeout important? Timeouts are usually just guidelines. This seems perhaps just an artifact of the testing strategy?

@pkch
Copy link
Contributor Author

pkch commented Apr 26, 2017

Yes I only care about precise timeout for testing purposes, What can we do though? I don't want to get rid of tests.

And I guess it doesn't hurt if the users know that timeout=2 actually means 1.99-2.01 sec rather than anywhere from 2-4 sec; that's a pretty large gap. Edit: I meant 2-4 sec, not 1-4.

mypy/util.py Outdated
if wait < 0.001:
break
wait /= 2
while True:
Copy link
Contributor Author

@pkch pkch Apr 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gvanrossum Moving the discussion here.

I could just make the time allowance in test much larger. That way it will pass even though this function guarantees timeout only up to a factor of 2x. But the test is already slow (adds 3-4 sec to the entire test suite) due to the file locks.

I was trying to make this test not affect total test runtime by adding another worker, since it's just sleeping for a while. But it didn't work yet (also, I fear having an extra worker might slow down tests a bit due to more process switching).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep it simple rather than trying to work against xdist. If it's hard to figure out now I'm sure it will be hard to maintain in the future.

Just in case, my go-to xdist wizard is @gnprice .

@gvanrossum
Copy link
Member

Thanks, indeed "files changed" no longer lists typeshed. Yay!

short_lock = 0.25
long_lock = 2

threads = [] # type: List[Thread]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if you made tmpdir and threads just instance variables, resetting and cleaning them for each test case rather than once for all test cases in the class? You have only one test_* method anyways, so...???

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like you to ponder this, but it's optional at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FWIW that was actually an old comment that GitHub had somehow hung on to. As were several other comments on code you had already ripped out... Sorry about the noise!)

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there! I'm going to be super careful here since it only affects Windows and I don't want to destabilize things again a week before the 0.510 release.

mypy/util.py Outdated
@@ -2,8 +2,16 @@

import re
import subprocess
import os
import sys
import math
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yo no longer need this import, nor itertools.count below.


def prepare_src_dest(self, src_lock_duration: float, dest_lock_duration: float
) -> Tuple[str, str]:
# Create two temporary files with random names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this into a docstring and add that this also spawns threads that lock each of them for the specified duration. Also mention the content of each is unique.

short_lock = timeout / 4
long_lock = timeout * 2

threads = [] # type: List[Thread]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should become an instance variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that the entire file locking arrangement is shared across all the test* methods of WIndowsReplace test case. Specifically, I want to wait for all the file locks to expire at the end, rather than wait for the file lock in each test to expire. The former is much more efficient: while the other tests are running, old file locks will naturally use up part (or even all) of their duration, so the wait at the end will be much shorter. If we adopt this approach, I need threads to be class-level since WindowsReplace is instantiated as many times as there are test* methods in it. It also means I must not delete the tmpdir until the end of the entire test case, i.e., until tearDownClass rather than instance-level tearDown. Is there a disadvantage to this approach?

Note: I temporarily reduced the number of test* metods to 1 due to issues with xdist sending different tests to different workers (and thus defeating the entire mechanism I described in the previous paragraph). But this is a horrible arrangement; once I figure out how to resolve the xdist issue, I definitely want to follow the standard practice of keeping each test in its own method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to having 4 tests, since otherwise it's too messy. There's no semantic problem with that, and I'll deal with the extra few seconds in test runtime later.

But with this, I'd like to keep the class-level attributes if you don't see an issue with them; both semantically and performance-wise, I don't want to force each individual test to wait for the threads to expire.

threads = [] # type: List[Thread]

@classmethod
def tearDownClass(cls) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please switch to (instance-level) tearDown.

self.threads.append(lock_file(dest, dest_lock_duration))
return src, dest

def replace_ok(self, src_lock_duration: float, dest_lock_duration: float,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring for this.

mypy/util.py Outdated
if wait > timeout:
raise
else:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this return into the try block, then you don't need the else block.

@@ -44,7 +44,11 @@ def tearDownClass(cls) -> None:

def prepare_src_dest(self, src_lock_duration: float, dest_lock_duration: float
) -> Tuple[str, str]:
# Create two temporary files with random names.
'''Create two files in self.tmpdir random names (src, dest) and unique contents;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use """ for docstrings please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opps one day I'll learn to follow patterns.

mypy/util.py Outdated
if wait < 0.001:
break
wait /= 2
while True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just keep it simple rather than trying to work against xdist. If it's hard to figure out now I'm sure it will be hard to maintain in the future.

Just in case, my go-to xdist wizard is @gnprice .

import tempfile
from contextlib import contextmanager
from threading import Thread
from unittest import TestCase, main, skipUnless
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, turns out mypy.test, unittest, and pytest are currently being used...ouch...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but Max didn't start that. IIUC we want to kill mypy.test, but it's low priority (not sure if there's even an issue) and pytest works well enough with unittest (also, personally, I'm more used to the unittest way of writing tests -- let pytest just be the test runner).

@pkch
Copy link
Contributor Author

pkch commented Apr 29, 2017

This PR is only limited to os.replace; it does not fix os.remove / os.unlink.

We don't rely on os.unlink in production, but AppVeyor [Edit: no, just local] test failures occasionally occur due to the use of tempfile.TemporaryDirectory in myunit.TestCase. Specifically, when it's cleaned up in myunit.TestCase.tear_down(), the regular shutil.rmtree can fail to delete files on Windows:

FAILURE  #15 run eval-test-A

Traceback (most recent call last):
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code
        exec(code, run_globals)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__main__.py", line 18, in <module>
        main()
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 231, in main
        num_total, num_fail, num_skip = run_test_recursive(t, 0, 0, 0, '', 0)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 280, in run_test_recursive
        stest, num_total, num_fail, num_skip, new_prefix, depth + 1)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 280, in run_test_recursive
        stest, num_total, num_fail, num_skip, new_prefix, depth + 1)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 261, in run_test_recursive
        is_fail, is_skip = run_single_test(name, test)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 298, in run_single_test
        test.tear_down()  # FIX: check exceptions
    File "C:\Users\pkch\Downloads\mypy\mypy\test\data.py", line 256, in tear_down
        super().tear_down()
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 134, in tear_down
        self.tmpdir.cleanup()
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\tempfile.py", line 811, in cleanup
        _shutil.rmtree(self.name)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 488, in rmtree
        return _rmtree_unsafe(path, onerror)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 378, in _rmtree_unsafe
        _rmtree_unsafe(fullname, onerror)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 378, in _rmtree_unsafe
        _rmtree_unsafe(fullname, onerror)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 378, in _rmtree_unsafe
        _rmtree_unsafe(fullname, onerror)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 383, in _rmtree_unsafe
        onerror(os.unlink, fullname, sys.exc_info())
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 381, in _rmtree_unsafe
        os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\pkch\\Downloads\\mypy\\tmp-test-dirs\\mypy-test-gue5k2nx\\tmp\\.mypy_cache\\3.6\\types.data.json'

This PR can fix os.remove / os.unlink if desired since it can be done with the same retry loop as os.replace. Unfortunately, it means we wouldn't be able to use tempfile.TemporaryDirectory (that would require monkey-patching shutil.rmtree that's hard coded in it). But we can simply replace the use of tempfile.TemporaryDirectory with our own directory creation / destruction function. Should I add that functionality to this PR or should I make that fix into a separate PR?

(I also submitted a suggestion to fix TemporaryDirectory, but that's not going to be easy since it affects infinintely many use cases and it takes a while to collect feedback about all of them.)

@gvanrossum
Copy link
Member

gvanrossum commented Apr 29, 2017

Have you heard the story of the Hydra? I'm beginning to despair that this PR will ever be complete. Please just focus on os.replace(), and if the other problem becomes too annoying we'll deal with it later -- but even then I suspect for a pure cleanup the suggestion I made for TemporaryDirectory (just catch and ignore the exception) is good enough, so we won't need to revisit the retry logic there.

@pkch
Copy link
Contributor Author

pkch commented Apr 29, 2017

No worries, we got it! This PR is now 100% focused on the replace.

I ran the tests of util.replace many times, and discovered intermittent failures. It turned out there were no issues with util.replace itself, but there were two problems in the test code:

  • I introduced a race condition where sometimes the thread that locks the file didn't kick in by the time util.replace is called; so of course replace succeeded even though the test didn't expect it to.
  • Sometimes the test tried to lock the file (by opening it) within microseconds after it was closed; even though it happened synchronously (so should be ok theoretically), Windows didn't like it, and occasionally caused PermissionError.

My fix to both problems simplifies the code a lot: instead of asking a thread to lock the file, I'm instead keeping the file open after I wrote to it, and asking a thread to close it later. That way, it's perfectly obvious what's going on, and there's also no unnecessary reopening of files.

I also split the tests into several smaller parts because it was too hard to debug without it. It works fine as is, but I do have a fix to xdist being a bit inefficient; I'll put it in a separate PR that I'm making for speeding up tests.

I reran the tests again many times, and found no more failures (apart from the intermittent inability to delete folders or files on Windows, which we'll deal with separately).

@gvanrossum
Copy link
Member

So I'm getting more and more worried about the complexity of this endeavour. The test failures you fixed do nothing to increase my confidence. (The fact that they weren't obvious in review is a warning sign.)

What would be lost if we simply caught the original error in write_cache() and gave up (maybe with a warning message)? The next run will re-create the cache file if necessary, and the tests will pass. (We also recently merged #3255 which disables cache writing in most tests.)

@pkch
Copy link
Contributor Author

pkch commented Apr 30, 2017

We can do that, but with a large cache failures would be pretty common. Still it's an option.

That said, writing tests in this case is much harder than writing the actual functionality because we have to artificially and reliably recreate the behavior with concurrency. There was never really a concern with the correctness of the code itself. Maybe we should just allow the (now fixed and simplified) tests to be of slightly higher complexity than the rest of the code in this PR?

Either way is fine with me, I don't have a large code base to comment on how annoying the caching issue is.

"""
src, dest = self.prepare_src_dest(src_lock_duration, dest_lock_duration)
util._replace(src, dest, timeout=timeout)
self.assertEqual(open(dest).read(), src, 'replace failed')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use with open...; this code will throw a ResourceWarning on recent versions of Python.

self.replace_ok(0, 0, self.timeout)

def test_original_problem(self) -> None:
# Make sure we can reproduce the issue with our setup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't appear to be any comment to tell us what "the issue" is. Maybe add a reference to the original GitHub issue?

gvanrossum pushed a commit that referenced this pull request Apr 30, 2017
This is an alternative attempt at fixing issue #3215, given the
complexity of PR #3239.
@gvanrossum
Copy link
Member

The cache issue so far has only been annoying for the AppVeyor tests.

I've got a simpler solution that catches write and replace errors for both JSON files. I think that because of the way I designed the code in the first place this is not going to create incorrectness -- the worst that can happen is that the next run will need to do extra work.

See #3288

@gvanrossum
Copy link
Member

BTW if we go with my solution, and later find the replace() issue still causing problems, we can revisit your version. But for now, my spider-sense is tingling.

@pkch
Copy link
Contributor Author

pkch commented Apr 30, 2017

Sure, if the problem from aborting the writing of the cache is just a minor performance issue, we don't have to bring the wait/retry loop just for Windows.

I'll apply CR comments from @JelleZijlstra , then close this PR, and move this code into a separate repo on my github account, just in case we or anyone else want to use it later.

@pkch pkch closed this Apr 30, 2017
gvanrossum added a commit that referenced this pull request May 2, 2017
This is an alternative attempt at fixing issue #3215, given the
complexity of PR #3239.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants