Reliable version of os.replace for Windows (wait/retry loop) #3239

pkch · 2017-04-25T03:36:35Z

Copying my comment from there:

Do we need to block until the atomic write completes? At present, the answer is that we don't need to block right away since we don't need to read from those files for a while (we can wait at least until the end of process_stale_scc). So we can postpone all the failed writes until some checkpoint in the future, greatly reducing the worst-case delay since we can wait for all the "stuck" files at once. However, this means that the wait/retry will be run from a faraway location in the code. This makes this approach dangerous: who knows where else write_cache() might be used in the future? In fact, there's already a *#*TODO item to add it to update.build_incremental_step(). (Note that write_cache saves a single module, not the entire cache.)

The cleanest solution would be (1) to refactor the write_cache to require a list of files to be written and wait for completion at the end of that function. This would have the best of both worlds: not wait for each file, and prevent future contributors from accidentally writing subtly buggy code that relies on writes that haven't happened yet.

An alternative is (2) to wait for each file, but if total wait time exceeds a threshold, exit and warn user that something is really wrong with access rights.

Finally, I can simply (3) wait for each file, and not worry about the case when multiple files are permanently stuck.

For now I used (3), but made the wait_for API to support refactoring into (1) if it's desired.

gvanrossum · 2017-04-25T16:25:32Z

Honestly the wait_for() function seems pretty over-engineered. Its only known call site (outside tests) has only one function and one exception. I think we can make it simpler.

I still pondering Jukka's comment in the issue but TBH I'm not sure I follow the scenarios. How likely is it that we're in a scenario where we end up waiting a second for each cache file we write? That would be pretty horrible, but does it really happen? After all the issue only noticed occasional AppVeyor failures.

I have some other nits but I think we should decide on the high-level approach first. Should we really change the behavior of write_cache() (which was pretty carefully designed when it comes to avoiding corrupt files no matter at what point the process is killed hard, assuming reasonable filesystem semantics) or is just retrying the replace() until it succeeds enough?

pkch · 2017-04-25T18:16:40Z

Honestly the wait_for() function seems pretty over-engineered. Its only known call site (outside tests) has only one function and one exception. I think we can make it simpler.

Sure, but I thought we may also want to use it for os.remove. There are few of those in the codebase and we haven't discussed them yet. I suspect permission errors on them either cause the same issue as os.replace, or are ignored which I am not sure is safe.

How likely is it that we're in a scenario where we end up waiting a second for each cache file we write?

User misconfiguration. For example, cache folder has write permission so new temp files can be created, but the existing cache files are owned by another user so they can never be replaced.

After all the issue only noticed occasional AppVeyor failures.

Yes, but you said we wanted to solve it not just for AppVeyor (which has a near-perfect setup) but for Windows users in general who have many different situations (bad configuration, many more processes running, etc.).

gvanrossum · 2017-04-25T19:28:57Z

Let's have a simple wait function now -- we can refactor it when we need it for the remove calls (those haven't been failing AFAIK).

For example, cache folder has write permission so new temp files can be created, but the existing cache files are owned by another user so they can never be replaced.

Then the first replace() call will time out, which will raise an exception, which AFAIK isn't caught. So in this case we should only be waiting 1 sec extra before getting an error. Right?

pkch · 2017-04-25T19:55:35Z

Then the first replace() call will time out, which will raise an exception, which AFAIK isn't caught. So in this case we should only be waiting 1 sec extra before getting an error. Right?

Yes precisely.

remove calls haven't been failing AFAIK

They are not because the PermissionError is caught. Can you double check that this doesn't cause incorrect semantics (leaving bad files in place for use in later stages)?

gvanrossum · 2017-04-25T20:20:35Z

Can you double check that this doesn't cause incorrect semantics (leaving bad files in place for use in later stages)?

Hm, let me see. (Literally just writing as I reason through this so you can check for yourself if there's a flaw in this argument.)

At this point the only thing we've done to the filesystem is the makedirs() call, which is idempotent. We've also computed the string to be written to the data file but haven't written it to the file yet. Now we're trying to compute the contents of the meta file and the crucial input, mtime+size of the source file (path) is unavailable due to a stat() error.

That file existed before (or we wouldn't have gotten this far). What we're doing here is mostly a slight optimization then -- cleaning up cache files for a source file that no longer exists. Presumably if the source file reappears it will have a different mtime so the cache files will be invalid anyway. Or if it is restored from a backup we're pessimizing things slightly. I think the reason this code exists at all is to clean the cache of irrelevant entries.

The try/except is meant to avoid needing other checks in case the cache files don't exist at all.

When could ignoring the error cause trouble? I guess the only interesting scenario is when there was a valid data/meta pair and somehow we delete the data but leave the meta.

On a subsequent run, assuming if the file is restored from backup (otherwise we would rule the meta file invalid without ever looking at the data file), the is_meta_fresh() function would get to the point where it tries to call getmtime() on the data file:

mypy/mypy/build.py

Line 818 in 65b9b0b

if os.path.getmtime(meta.data_json) != meta.data_mtime:

. That stat() call would then fail because the data file doesn't exist, so the run would crash -- recovery from that would be deleting the cache (or just that particular meta file).

Now suppose we add error handling to the cache handling code -- then that getmtime() call would still fail but we'd presumably catch the error -- and then is_meta_fresh() should rule the meta file stale and return False, at which point we're still fine.

Concluding I don't think that the code you flagged can cause incorrect semantics -- but thanks for asking and it was a nice puzzle!

pkch · 2017-04-25T20:58:55Z

I agree. I would summarize your argument by saying that before anyone relies on a cache file, they call is_meta_fresh, and that function can't be tricked by deleting / not deleting some file since it doesn't assume anything (it verifies everything from scratch).

I was also afraid that someone might call self.write_cache(), assume that since we just wrote it, the cache must now be fresh, and start relying on it without checking is_meta_fresh(). (This would be a bug because if os.remove failed by accident, we left a cache file that doesn't correspond to a real file.)

It never happens in the current version, and based on the code style I see, I don't see much risk it will happen in the future: is_meta_fresh() is a standard gateway to doing anything with cache. That said, I almost feel like calling is_meta_fresh from inside find_cache_meta, and making find_cache_meta return None if is_meta_fresh returns False. No big deal though.

gvanrossum · 2017-04-25T21:10:31Z

I almost feel like calling is_meta_fresh from inside find_cache_meta

I believe there's a very good reason why these two are separated, having to do with the different phases of processing. API design is hard!

pkch · 2017-04-26T00:08:14Z

I am bit concerned about my tests; they rely on pretty tight timing constraints (0.1 sec < 0.25 sec < 0.4 sec). While normally, this is more than enough of a difference relative to random noise and disk I/O, on a slow VM with multiple processes running in parallel and/or heavy disk activity it might actually fail intermittently.

At the same time, I don't want to delete the tests, or make them much longer (since these delays add onto the total test time).

I guess I'll use a separate pair of test src / dest files for each test, so that the waiting time for locks to expire after the test already passed isn't done sequentially. This will give me more room to increase the duration of locks and timeouts used for tests.

gvanrossum · 2017-04-26T00:17:56Z

I am bit concerned about my tests

As you should be, given the (ironic) test failures...

xdist isn't flexible enough to run a given set of tests in one process If these tests are split into multiple processes, they will take a lot longer since each will wait for locks to die out at the end.

gvanrossum · 2017-04-25T15:50:43Z

mypy/util.py

+
+
+def _replace(src: PathType, dest: PathType) -> None:
+    repl = cast(Callable[[], None], partial(os.replace, src, dest))


Can't you use a lambda? Shouldn't even need the cast:

repl = lambda: os.replace(src, dest)

Oh duh.. This code is no longer in the current version, but I'm glad I won't be casting partial as often in the future.

gvanrossum · 2017-04-25T15:51:46Z

mypy/test/testutil.py

+try:
+    import collections.abc as collections_abc
+except ImportError:
+    import collections as collections_abc  # type: ignore # PY32 and earlier


But we don't support PY32 any more.

gvanrossum · 2017-04-25T15:55:00Z

mypy/test/testutil.py

+    start_time = time.perf_counter()
+
+    def f() -> None:
+        if time.perf_counter() - start_time < lag:


IMO this would be more readable as if time.perf_counter() < start_time + lag: -- then I can see immediately that it's taking this branch when called before lag time has passed.

gvanrossum · 2017-04-25T15:59:17Z

mypy/test/testutil.py

+class WaitRetryTests(TestCase):
+    def test_waitfor(self) -> None:
+        with self.assertRaises(OSError):
+            util.wait_for(create_funcs(), (PermissionError, FileExistsError), 0.1)


IMO the exceptions should be in a list, not a tuple.

Ah good to know, somehow I thought exception syntax was the same as isinstance, only tuples allowed :)

gvanrossum · 2017-04-25T15:59:58Z

mypy/test/testutil.py

+
+class WaitRetryTests(TestCase):
+    def test_waitfor(self) -> None:
+        with self.assertRaises(OSError):


I highly recommend adding a comment to each subtest (i.e. each util.wait_for() call, possibly wrapped in a context manager) explaining what it is for.

gvanrossum · 2017-04-26T16:32:28Z

mypy/test/testutil.py

+
+
+def lock_file(filename: str, duration: float) -> Thread:
+    '''


Can you stick to the prevailing docstring style?

"""Opens filename (which must exist) for reading. After duration sec, release the handle. """

Though honestly the first line should probably say something like

"""Open a file and keep it open in a background thread for a while."""

gvanrossum · 2017-04-26T16:37:49Z

mypy/test/testutil.py

+
+@skipUnless(WIN32, "only relevant for Windows")
+class ReliableReplace(TestCase):
+    # will be cleaned up automatically when this class goes out of scope


Actually that would depend on Python finalization order which is a nightmare.

Can't you make this more explicit? E.g. in tearDownClass() below.

Oh I didn't know that.. scary, I always relied on that. Fixed.

gvanrossum · 2017-04-26T16:42:29Z

mypy/util.py

+
+
+if sys.version_info >= (3, 6):
+    PathType = Union[AnyStr, os.PathLike]


os.PathLike is also generic in AnyStr. But by not mentioning that here you'll get it instantiated with Any instead.

Fixed. I wonder if perhaps PathType should be availalbe from types stdlib module.

I've been thinking of proposing to add it to typing.

There is actually python/typing#402 that seems related.

Hmm, actually it's not going to work in runtime because PathLike derives only from abc.ABC, and there's no equivalent thing exported from typing. I'll use AnyStr for now.

@pkch Maybe a string literal 'PathLike[AnyStr]'?

@ilevkivskyi it works! I thought strings can only help with forward declarations, but I guess they just tell runtime not to worry about it whatever the cause might be.

Yes, it works until someone calls get_type_hints on this function. Generally, this is considered a temporary workaround, for example we prohibit things like 'collections.Counter[str]', but here I think it is OK.

Also note, that when you write PathType = Union[AnyStr, 'os.PathLike[AnyStr]'], you create a generic type alias (since AnyStr is a type variable), so that when you write just PathType, it will be translated to Union[Any, os.PathLike[Any]]. What you want is probably:

def _replace(src: PathType[AnyStr], dest: PathType[AnyStr], timeout: float = 10) -> None: ...

gvanrossum · 2017-04-26T16:43:00Z

mypy/util.py

+    PathType = AnyStr
+
+
+def _replace(src: PathType, dest: PathType, timeout: float = 10) -> None:


Because you don't parameterize the PathType types, they'll be using Any instead of AnyStr.

I make this mistake 3 times a day. I guess I'll update #3141 (--warn-implicit-any) in case anyone else has this problem.

gvanrossum · 2017-04-26T16:44:19Z

mypy/util.py

+    Increase wait time exponentially until total wait of timeout sec
+    On timeout, give up and reraise the last exception seen
+    '''
+    n_iter = max(1, math.ceil(math.log2(timeout / 0.001)))


Can you bound the iteration count without using the math number? ISTM you can just use something like

wait = 0.001 while wait <= 0.5: <stuff> time.sleep(wait) wait = wait*2

Yeah I can, but if I use the no-math version, the final iteration will have wait time anywhere between 0.25 and 0.5 sec, so the total timeout is only precise up to a multiplicative factor of 2. That precision might be fine for our purposes, but the tests become really inefficient. What do you think?

I don't understand why you can't generate exactly the same sequence of wait times using either algorithm?

Well, I need to know what wait should start from. I'd like to start it from somewhere around 0.001 sec, but of course I don't care about the precise value. I do care about the total timeout though, which means I care about the final (rather than starting) value of the wait. Specifically, if my timeout is say 1 sec, I want my final wait just before I give up, to be 0.5 sec (that way the total wait will be 0.5 + 0.5/(2**1) + 0.5/(2**2) + ... = 1).

But in order to get the correct starting value without math, I'd need to create a second loop, where I repeat timeout /= 2 until it gets to (say) below 0.001. I can do that, but is it better than calling math.log?

Actually, it might be easier to read, let me commit such a version, and let's decide which is better.

gvanrossum · 2017-04-26T20:20:22Z

Do you need to sync typeshed as part of this PR? That's usually an antipattern.

pkch · 2017-04-26T20:36:36Z

Hmm AppVeyor refused to build due to unmergeable branch until I commit typeshed.

I'm surprised too, this is only the second time it's happening since I've been working with mypy (the first time was over a month ago). Let me commit without typeshed, and we can figure out what's going on.

gvanrossum · 2017-04-26T21:01:45Z

Oh, maybe it was due to the typeshed mess-up in a recent PR. It's all fixed now.

gvanrossum · 2017-04-26T21:17:33Z

(The comment doesn't seem to have a "reply" link so responding here.)

Why is the exact duration of the timeout important? Timeouts are usually just guidelines. This seems perhaps just an artifact of the testing strategy?

pkch · 2017-04-26T21:21:44Z

Yes I only care about precise timeout for testing purposes, What can we do though? I don't want to get rid of tests.

And I guess it doesn't hurt if the users know that timeout=2 actually means 1.99-2.01 sec rather than anywhere from 2-4 sec; that's a pretty large gap. Edit: I meant 2-4 sec, not 1-4.

pkch · 2017-04-26T21:30:22Z

mypy/util.py

+        if wait < 0.001:
+            break
+        wait /= 2
+    while True:


@gvanrossum Moving the discussion here.

I could just make the time allowance in test much larger. That way it will pass even though this function guarantees timeout only up to a factor of 2x. But the test is already slow (adds 3-4 sec to the entire test suite) due to the file locks.

I was trying to make this test not affect total test runtime by adding another worker, since it's just sleeping for a while. But it didn't work yet (also, I fear having an extra worker might slow down tests a bit due to more process switching).

Let's just keep it simple rather than trying to work against xdist. If it's hard to figure out now I'm sure it will be hard to maintain in the future.

Just in case, my go-to xdist wizard is @gnprice .

gvanrossum · 2017-04-27T03:32:52Z

Thanks, indeed "files changed" no longer lists typeshed. Yay!

gvanrossum · 2017-04-26T20:19:03Z

mypy/test/testutil.py

+    short_lock = 0.25
+    long_lock = 2
+
+    threads = []  # type: List[Thread]


What would happen if you made tmpdir and threads just instance variables, resetting and cleaning them for each test case rather than once for all test cases in the class? You have only one test_* method anyways, so...???

I'd still like you to ponder this, but it's optional at this point.

(FWIW that was actually an old comment that GitHub had somehow hung on to. As were several other comments on code you had already ripped out... Sorry about the noise!)

gvanrossum

Getting there! I'm going to be super careful here since it only affects Windows and I don't want to destabilize things again a week before the 0.510 release.

gvanrossum · 2017-04-28T18:10:29Z

mypy/util.py

@@ -2,8 +2,16 @@

 import re
 import subprocess
+import os
+import sys
+import math


Yo no longer need this import, nor itertools.count below.

gvanrossum · 2017-04-28T18:20:12Z

mypy/test/testutil.py

+
+    def prepare_src_dest(self, src_lock_duration: float, dest_lock_duration: float
+                         ) -> Tuple[str, str]:
+        # Create two temporary files with random names.


Make this into a docstring and add that this also spawns threads that lock each of them for the specified duration. Also mention the content of each is unique.

gvanrossum · 2017-04-28T18:20:24Z

mypy/test/testutil.py

+    short_lock = timeout / 4
+    long_lock = timeout * 2
+
+    threads = []  # type: List[Thread]


This should become an instance variable.

My idea was that the entire file locking arrangement is shared across all the test* methods of WIndowsReplace test case. Specifically, I want to wait for all the file locks to expire at the end, rather than wait for the file lock in each test to expire. The former is much more efficient: while the other tests are running, old file locks will naturally use up part (or even all) of their duration, so the wait at the end will be much shorter. If we adopt this approach, I need threads to be class-level since WindowsReplace is instantiated as many times as there are test* methods in it. It also means I must not delete the tmpdir until the end of the entire test case, i.e., until tearDownClass rather than instance-level tearDown. Is there a disadvantage to this approach?

Note: I temporarily reduced the number of test* metods to 1 due to issues with xdist sending different tests to different workers (and thus defeating the entire mechanism I described in the previous paragraph). But this is a horrible arrangement; once I figure out how to resolve the xdist issue, I definitely want to follow the standard practice of keeping each test in its own method.

I went back to having 4 tests, since otherwise it's too messy. There's no semantic problem with that, and I'll deal with the extra few seconds in test runtime later.

But with this, I'd like to keep the class-level attributes if you don't see an issue with them; both semantically and performance-wise, I don't want to force each individual test to wait for the threads to expire.

gvanrossum · 2017-04-28T18:20:39Z

mypy/test/testutil.py

+    threads = []  # type: List[Thread]
+
+    @classmethod
+    def tearDownClass(cls) -> None:


Please switch to (instance-level) tearDown.

gvanrossum · 2017-04-28T18:22:02Z

mypy/test/testutil.py

+        self.threads.append(lock_file(dest, dest_lock_duration))
+        return src, dest
+
+    def replace_ok(self, src_lock_duration: float, dest_lock_duration: float,


Add a docstring for this.

gvanrossum · 2017-04-28T18:25:06Z

mypy/util.py

+            if wait > timeout:
+                raise
+        else:
+            return


You can move this return into the try block, then you don't need the else block.

gvanrossum · 2017-04-29T00:28:50Z

mypy/test/testutil.py

@@ -44,7 +44,11 @@ def tearDownClass(cls) -> None:

    def prepare_src_dest(self, src_lock_duration: float, dest_lock_duration: float
                         ) -> Tuple[str, str]:
-        # Create two temporary files with random names.
+        '''Create two files in self.tmpdir random names (src, dest) and unique contents;


Can you use """ for docstrings please?

Opps one day I'll learn to follow patterns.

gvanrossum · 2017-04-29T00:30:39Z

mypy/util.py

+        if wait < 0.001:
+            break
+        wait /= 2
+    while True:


Let's just keep it simple rather than trying to work against xdist. If it's hard to figure out now I'm sure it will be hard to maintain in the future.

Just in case, my go-to xdist wizard is @gnprice .

refi64 · 2017-04-29T00:34:59Z

mypy/test/testutil.py

+import tempfile
+from contextlib import contextmanager
+from threading import Thread
+from unittest import TestCase, main, skipUnless


Wow, turns out mypy.test, unittest, and pytest are currently being used...ouch...

Yeah, but Max didn't start that. IIUC we want to kill mypy.test, but it's low priority (not sure if there's even an issue) and pytest works well enough with unittest (also, personally, I'm more used to the unittest way of writing tests -- let pytest just be the test runner).

pkch · 2017-04-29T02:21:31Z

This PR is only limited to os.replace; it does not fix os.remove / os.unlink.

We don't rely on os.unlink in production, but ~~AppVeyor~~ [Edit: no, just local] test failures occasionally occur due to the use of tempfile.TemporaryDirectory in myunit.TestCase. Specifically, when it's cleaned up in myunit.TestCase.tear_down(), the regular shutil.rmtree can fail to delete files on Windows:

FAILURE  #15 run eval-test-A

Traceback (most recent call last):
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code
        exec(code, run_globals)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__main__.py", line 18, in <module>
        main()
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 231, in main
        num_total, num_fail, num_skip = run_test_recursive(t, 0, 0, 0, '', 0)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 280, in run_test_recursive
        stest, num_total, num_fail, num_skip, new_prefix, depth + 1)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 280, in run_test_recursive
        stest, num_total, num_fail, num_skip, new_prefix, depth + 1)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 261, in run_test_recursive
        is_fail, is_skip = run_single_test(name, test)
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 298, in run_single_test
        test.tear_down()  # FIX: check exceptions
    File "C:\Users\pkch\Downloads\mypy\mypy\test\data.py", line 256, in tear_down
        super().tear_down()
    File "C:\Users\pkch\Downloads\mypy\mypy\myunit\__init__.py", line 134, in tear_down
        self.tmpdir.cleanup()
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\tempfile.py", line 811, in cleanup
        _shutil.rmtree(self.name)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 488, in rmtree
        return _rmtree_unsafe(path, onerror)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 378, in _rmtree_unsafe
        _rmtree_unsafe(fullname, onerror)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 378, in _rmtree_unsafe
        _rmtree_unsafe(fullname, onerror)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 378, in _rmtree_unsafe
        _rmtree_unsafe(fullname, onerror)
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 383, in _rmtree_unsafe
        onerror(os.unlink, fullname, sys.exc_info())
    File "C:\Users\pkch\AppData\Local\Programs\Python\Python36\lib\shutil.py", line 381, in _rmtree_unsafe
        os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\pkch\\Downloads\\mypy\\tmp-test-dirs\\mypy-test-gue5k2nx\\tmp\\.mypy_cache\\3.6\\types.data.json'

This PR can fix os.remove / os.unlink if desired since it can be done with the same retry loop as os.replace. Unfortunately, it means we wouldn't be able to use tempfile.TemporaryDirectory (that would require monkey-patching shutil.rmtree that's hard coded in it). But we can simply replace the use of tempfile.TemporaryDirectory with our own directory creation / destruction function. Should I add that functionality to this PR or should I make that fix into a separate PR?

(I also submitted a suggestion to fix TemporaryDirectory, but that's not going to be easy since it affects infinintely many use cases and it takes a while to collect feedback about all of them.)

gvanrossum · 2017-04-29T04:19:30Z

Have you heard the story of the Hydra? I'm beginning to despair that this PR will ever be complete. Please just focus on os.replace(), and if the other problem becomes too annoying we'll deal with it later -- but even then I suspect for a pure cleanup the suggestion I made for TemporaryDirectory (just catch and ignore the exception) is good enough, so we won't need to revisit the retry logic there.

pkch · 2017-04-29T06:45:29Z

No worries, we got it! This PR is now 100% focused on the replace.

I ran the tests of util.replace many times, and discovered intermittent failures. It turned out there were no issues with util.replace itself, but there were two problems in the test code:

I introduced a race condition where sometimes the thread that locks the file didn't kick in by the time util.replace is called; so of course replace succeeded even though the test didn't expect it to.
Sometimes the test tried to lock the file (by opening it) within microseconds after it was closed; even though it happened synchronously (so should be ok theoretically), Windows didn't like it, and occasionally caused PermissionError.

My fix to both problems simplifies the code a lot: instead of asking a thread to lock the file, I'm instead keeping the file open after I wrote to it, and asking a thread to close it later. That way, it's perfectly obvious what's going on, and there's also no unnecessary reopening of files.

I also split the tests into several smaller parts because it was too hard to debug without it. It works fine as is, but I do have a fix to xdist being a bit inefficient; I'll put it in a separate PR that I'm making for speeding up tests.

I reran the tests again many times, and found no more failures (apart from the intermittent inability to delete folders or files on Windows, which we'll deal with separately).

gvanrossum · 2017-04-29T19:19:24Z

So I'm getting more and more worried about the complexity of this endeavour. The test failures you fixed do nothing to increase my confidence. (The fact that they weren't obvious in review is a warning sign.)

What would be lost if we simply caught the original error in write_cache() and gave up (maybe with a warning message)? The next run will re-create the cache file if necessary, and the tests will pass. (We also recently merged #3255 which disables cache writing in most tests.)

pkch · 2017-04-30T02:45:57Z

We can do that, but with a large cache failures would be pretty common. Still it's an option.

That said, writing tests in this case is much harder than writing the actual functionality because we have to artificially and reliably recreate the behavior with concurrency. There was never really a concern with the correctness of the code itself. Maybe we should just allow the (now fixed and simplified) tests to be of slightly higher complexity than the rest of the code in this PR?

Either way is fine with me, I don't have a large code base to comment on how annoying the caching issue is.

JelleZijlstra · 2017-04-30T02:49:25Z

mypy/test/testutil.py

+        """
+        src, dest = self.prepare_src_dest(src_lock_duration, dest_lock_duration)
+        util._replace(src, dest, timeout=timeout)
+        self.assertEqual(open(dest).read(), src, 'replace failed')


Should use with open...; this code will throw a ResourceWarning on recent versions of Python.

JelleZijlstra · 2017-04-30T02:50:38Z

mypy/test/testutil.py

+        self.replace_ok(0, 0, self.timeout)
+
+    def test_original_problem(self) -> None:
+        # Make sure we can reproduce the issue with our setup.


There doesn't appear to be any comment to tell us what "the issue" is. Maybe add a reference to the original GitHub issue?

This is an alternative attempt at fixing issue #3215, given the complexity of PR #3239.

gvanrossum · 2017-04-30T04:24:23Z

The cache issue so far has only been annoying for the AppVeyor tests.

I've got a simpler solution that catches write and replace errors for both JSON files. I think that because of the way I designed the code in the first place this is not going to create incorrectness -- the worst that can happen is that the next run will need to do extra work.

See #3288

gvanrossum · 2017-04-30T04:26:18Z

BTW if we go with my solution, and later find the replace() issue still causing problems, we can revisit your version. But for now, my spider-sense is tingling.

pkch · 2017-04-30T06:54:25Z

Sure, if the problem from aborting the writing of the cache is just a minor performance issue, we don't have to bring the wait/retry loop just for Windows.

I'll apply CR comments from @JelleZijlstra , then close this PR, and move this code into a separate repo on my github account, just in case we or anyone else want to use it later.

This is an alternative attempt at fixing issue #3215, given the complexity of PR #3239.

Wait/retry when cannot delete file on Windows

ff27b6e

pkch force-pushed the winreplace branch from ec4c413 to ff27b6e Compare April 25, 2017 04:23

Simplify wait_for; test util.replace directly

b089a79

pkch force-pushed the winreplace branch from f10f197 to b089a79 Compare April 25, 2017 23:26

pkch changed the title ~~Wait/retry when cannot delete file on Windows~~ Reliable version of os.replace for Windows (wait/retry loop) Apr 25, 2017

pkch added 3 commits April 25, 2017 17:32

Refactor tests to create a separate src/dest each time

42c6987

Wait for lock expiration at the end

2d369c0

Move all tests into one because xdist

8e61b79

xdist isn't flexible enough to run a given set of tests in one process If these tests are split into multiple processes, they will take a lot longer since each will wait for locks to die out at the end.

pkch force-pushed the winreplace branch from 5da6eec to 8e61b79 Compare April 26, 2017 01:14

gvanrossum requested changes Apr 26, 2017

View reviewed changes

pkch force-pushed the winreplace branch from 440d347 to 6ac2278 Compare April 26, 2017 19:58

Address CR

beb453f

pkch force-pushed the winreplace branch from 81498fe to beb453f Compare April 26, 2017 20:42

pkch force-pushed the winreplace branch from e0cbd29 to 504adcf Compare April 26, 2017 21:16

pkch force-pushed the winreplace branch from 504adcf to a70e04f Compare April 26, 2017 21:26

pkch commented Apr 26, 2017

View reviewed changes

gvanrossum approved these changes Apr 27, 2017

View reviewed changes

gvanrossum mentioned this pull request Apr 28, 2017

Fix crash due to deserialization of TypedDict type objects #3275

Merged

gvanrossum requested changes Apr 28, 2017

View reviewed changes

gvanrossum mentioned this pull request Apr 28, 2017

Make .flake8 not a symlink. #3280

Merged

refi64 mentioned this pull request Apr 28, 2017

Windows howl-editor/howl#261

Open

Address CR

0cc6337

gvanrossum requested changes Apr 29, 2017

View reviewed changes

refi64 reviewed Apr 29, 2017

View reviewed changes

Fix quotes

f05647c

Split tests according to their purpose

f9de4c9

Prevent race condition

199a103

pkch force-pushed the winreplace branch from 944d41c to 199a103 Compare April 29, 2017 06:58

JelleZijlstra reviewed Apr 30, 2017

View reviewed changes

gvanrossum pushed a commit that referenced this pull request Apr 30, 2017

Catch and ignore errors writing JSON files.

a12432d

This is an alternative attempt at fixing issue #3215, given the complexity of PR #3239.

gvanrossum mentioned this pull request Apr 30, 2017

Catch and ignore errors writing JSON files. #3288

Merged

CR fixes

a21f6b5

pkch closed this Apr 30, 2017

pkch mentioned this pull request Apr 30, 2017

Ignore errors when deleting files and folders on Windows #3289

Closed

gvanrossum added a commit that referenced this pull request May 2, 2017

Catch and ignore errors writing JSON files. (#3288)

ff9abd8

This is an alternative attempt at fixing issue #3215, given the complexity of PR #3239.

Nurdok mentioned this pull request May 23, 2017

Ignore errors when cleaning up test temporary directories #3416

Merged



		def _replace(src: PathType, dest: PathType) -> None:
		repl = cast(Callable[[], None], partial(os.replace, src, dest))



		if sys.version_info >= (3, 6):
		PathType = Union[AnyStr, os.PathLike]

		PathType = AnyStr


		def _replace(src: PathType, dest: PathType, timeout: float = 10) -> None:

Reliable version of os.replace for Windows (wait/retry loop) #3239

Reliable version of os.replace for Windows (wait/retry loop) #3239

Conversation

pkch commented Apr 25, 2017 • edited Loading

gvanrossum commented Apr 25, 2017

pkch commented Apr 25, 2017

gvanrossum commented Apr 25, 2017

pkch commented Apr 25, 2017

gvanrossum commented Apr 25, 2017

pkch commented Apr 25, 2017

gvanrossum commented Apr 25, 2017

pkch commented Apr 26, 2017

gvanrossum commented Apr 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilevkivskyi Apr 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum commented Apr 26, 2017

pkch commented Apr 26, 2017 • edited Loading

gvanrossum commented Apr 26, 2017

gvanrossum commented Apr 26, 2017

pkch commented Apr 26, 2017 • edited Loading

pkch Apr 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum commented Apr 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pkch commented Apr 29, 2017 • edited Loading

gvanrossum commented Apr 29, 2017 • edited Loading

pkch commented Apr 29, 2017 • edited Loading

gvanrossum commented Apr 29, 2017

pkch commented Apr 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum commented Apr 30, 2017

gvanrossum commented Apr 30, 2017

pkch commented Apr 30, 2017

pkch commented Apr 25, 2017 •

edited

Loading

ilevkivskyi Apr 26, 2017 •

edited

Loading

pkch commented Apr 26, 2017 •

edited

Loading

pkch commented Apr 26, 2017 •

edited

Loading

pkch Apr 26, 2017 •

edited

Loading

pkch commented Apr 29, 2017 •

edited

Loading

gvanrossum commented Apr 29, 2017 •

edited

Loading

pkch commented Apr 29, 2017 •

edited

Loading

pkch commented Apr 30, 2017 •

edited

Loading