Skip to content

gh-117841: Add C implementation of ntpath.lexists #117842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

nineteendo
Copy link
Contributor

@nineteendo nineteendo commented Apr 13, 2024

Benchmark

script
::speedup-ntpath.lexists.bat
@echo off
echo existent && call main\python -m timeit -s "import ntpath" "ntpath.lexists('main')" && call speedup-ntpath.lexists\python -m timeit -s "import ntpath" "ntpath.lexists('main')"
echo non-existent && call main\python -m timeit -s "import ntpath" "ntpath.lexists('foo')" && call speedup-ntpath.lexists\python -m timeit -s "import ntpath" "ntpath.lexists('foo')"
existent
5000 loops, best of 5: 46.5 usec per loop # before
10000 loops, best of 5: 33.2 usec per loop # after
# -> 1.40x faster
non-existent
10000 loops, best of 5: 22.2 usec per loop # before
20000 loops, best of 5: 15.6 usec per loop # after
# -> 1.42x faster

@nineteendo nineteendo changed the title Speedup ntpath.lexists gh-117841: Speedup ntpath.lexists Apr 13, 2024
@nineteendo nineteendo marked this pull request as ready for review April 13, 2024 11:31
barneygale added a commit to barneygale/cpython that referenced this pull request Apr 13, 2024
…lexists()`

Use `os.path.lexists()` rather than `os.lstat()` to test whether paths
exist. This is equivalent on POSIX, but faster on Windows.
@nineteendo nineteendo changed the title gh-117841: Speedup ntpath.lexists gh-117841: Add C implementation of ntpath.lexists Apr 13, 2024
@eryksun
Copy link
Contributor

eryksun commented Apr 14, 2024

I'm working on what I hope will be an improved version compared to the first draft. AFAIK, the GetFileInformationByName() fast path will be generally available when Windows 11 24H2 is released later this year. Until then, and for older systems, we need to focus on improving the 'slow' path. I'm looking to avoid full STAT and LSTAT calls, except as a last resort.

@nineteendo nineteendo marked this pull request as draft April 14, 2024 21:48
@eryksun
Copy link
Contributor

eryksun commented Apr 14, 2024

Here's the revised implementation of nt_exists():

static PyObject *
nt_exists(PyObject *path, int follow_symlinks)
{
    path_t _path = PATH_T_INITIALIZE("exists", "path", 0, 1);
    HANDLE hfile;
    BOOL traverse = follow_symlinks;
    int result = 0;

    if (!path_converter(path, &_path)) {
        path_cleanup(&_path);
        if (PyErr_ExceptionMatches(PyExc_ValueError)) {
            PyErr_Clear();
            Py_RETURN_FALSE;
        }
        return NULL;
    }

    Py_BEGIN_ALLOW_THREADS
    if (_path.fd != -1) {
        hfile = _Py_get_osfhandle_noraise(_path.fd);
        if (hfile != INVALID_HANDLE_VALUE) {
            result = 1;
        }
    }
    else if (_path.wide) {
        BOOL slow_path = TRUE;
        FILE_STAT_BASIC_INFORMATION statInfo;
        if (_Py_GetFileInformationByName(_path.wide, FileStatBasicByNameInfo,
                &statInfo, sizeof(statInfo)))
        {
            if (!(statInfo.FileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) ||
                !follow_symlinks &&
                IsReparseTagNameSurrogate(statInfo.ReparseTag))
            {
                slow_path = FALSE;
                result = 1;
            }
            else {
                // reparse point but not name-surrogate
                traverse = TRUE;
            }
        }
        else if (_Py_GetFileInformationByName_ErrorIsTrustworthy(
                        GetLastError()))
        {
            slow_path = FALSE;
        }
        if (slow_path) {
            BOOL traverse = follow_symlinks;
            if (!traverse) {
                hfile = CreateFileW(_path.wide, FILE_READ_ATTRIBUTES, 0, NULL,
                            OPEN_EXISTING, FILE_FLAG_OPEN_REPARSE_POINT |
                            FILE_FLAG_BACKUP_SEMANTICS, NULL);
                if (hfile != INVALID_HANDLE_VALUE) {
                    FILE_ATTRIBUTE_TAG_INFO info;
                    if (GetFileInformationByHandleEx(hfile,
                            FileAttributeTagInfo, &info, sizeof(info)))
                    {
                        if (!(info.FileAttributes &
                                FILE_ATTRIBUTE_REPARSE_POINT) ||
                            IsReparseTagNameSurrogate(info.ReparseTag))
                        {
                            result = 1;
                        }
                        else {
                            // reparse point but not name-surrogate
                            traverse = TRUE;
                        }
                    }
                    else {
                        // device or legacy filesystem
                        result = 1;
                    }
                    CloseHandle(hfile);
                }
                else {
                    STRUCT_STAT st; 
                    switch (GetLastError()) {
                    case ERROR_ACCESS_DENIED:
                    case ERROR_SHARING_VIOLATION:
                    case ERROR_CANT_ACCESS_FILE:
                    case ERROR_INVALID_PARAMETER:
                        if (!LSTAT(_path.wide, &st)) {
                            result = 1;
                        }
                    }
                }
            }
            if (traverse) {
                hfile = CreateFileW(_path.wide, FILE_READ_ATTRIBUTES, 0, NULL,
                            OPEN_EXISTING, FILE_FLAG_BACKUP_SEMANTICS, NULL);
                if (hfile != INVALID_HANDLE_VALUE) {
                    CloseHandle(hfile);
                    result = 1;
                }
                else {
                    STRUCT_STAT st; 
                    switch (GetLastError()) {
                    case ERROR_ACCESS_DENIED:
                    case ERROR_SHARING_VIOLATION:
                    case ERROR_CANT_ACCESS_FILE:
                    case ERROR_INVALID_PARAMETER:
                        if (!STAT(_path.wide, &st)) {
                            result = 1;
                        }
                    }
                }
            }
        }
    }
    Py_END_ALLOW_THREADS

    path_cleanup(&_path);
    if (result) {
        Py_RETURN_TRUE;
    }
    Py_RETURN_FALSE;
}

Co-authored-by: Eryk Sun <[email protected]>
@nineteendo
Copy link
Contributor Author

Are you happy with the new implementation, or do you have more ideas?

@nineteendo
Copy link
Contributor Author

The performance for existent files has already greatly improved. Nice job!

@eryksun
Copy link
Contributor

eryksun commented Apr 15, 2024

That's all I have for now.

@nineteendo nineteendo marked this pull request as ready for review April 15, 2024 06:36
@serhiy-storchaka serhiy-storchaka self-requested a review April 15, 2024 10:49
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance looks correct, although I am not happy that we added so much complicated code for pure optimization of functions which should not be used in performance critical code (os.scandir() should be used instead).

@eryksun
Copy link
Contributor

eryksun commented Apr 15, 2024

At first glance looks correct, although I am not happy that we added so much complicated code for pure optimization of functions which should not be used in performance critical code (os.scandir() should be used instead).

The performance of os.stat() and os.lstat() will improve on Windows once GetFileInformationByName() is generally supported, starting with Windows 11 24H2 later this year. Eventually I expect the builtin _path_* tests to be removed when Python stops supporting Windows 10.

A disappointment with these builtin functions is that they can't be leveraged in pathlib.Path because they're designed to simply return True or False without raising OSError exceptions. The problem is that the pathlib.Path methods exists(), is_dir(), is_file(), and is_symlink() only ignore OSError exceptions for a small set of errno values and Windows error codes.

Regarding scandir(), in principle, the C implementations of _path_isdir() and _path_isfile() could be used by the DirEntry methods is_dir() and is_file(). These methods have to call STAT() if the entry is a reparse point, except if follow_symlinks is false and it's a name-surrogate reparse point1. That's more work than necessary. However, the DirEntry tests only handle FileNotFoundError exceptions, so there's a mismatch in error handling. Fortunately there usually aren't so many reparse points that the accumulated cost of STAT() calls would matter much.

Footnotes

  1. That's if DirEntry was updated to behave like os.stat() and os.lstat(), for which follow_symlinks is interpreted liberally to include junctions and other name-surrogate file types, not just literal symlinks, while other types of reparse points are always traversed because they're not meant to be handled like links. For example, DirEntry.is_dir() incorrectly returns true for a junction that targets a non-existent path.

@barneygale
Copy link
Contributor

A disappointment with these builtin functions is that they can't be leveraged in pathlib.Path because they're designed to simply return True or False without raising OSError exceptions. The problem is that the pathlib.Path methods exists(), is_dir(), is_file(), and is_symlink() only ignore OSError exceptions for a small set of errno values and Windows error codes.

We can at least use it from Path.glob(), which suppresses all OSErrors. I have a draft PR for that here: #117858

I wouldn't mind if we made the is_ methods suppress all OSErrors rather than a fairly arbitrary subset as we do now. We don't document what's suppressed, and it's changed before.

@nineteendo
Copy link
Contributor Author

cc @zooba

@zooba
Copy link
Member

zooba commented May 1, 2024

So I'm hesitant to take this for three reasons (and these do apply to previous enhancements as well, but didn't exist at that time):

  • stat is going to get fast enough on newer Windows for this optimisation to be redundant
  • it's a lot of code and not easily maintainable
  • I don't know any scenarios where you have a lot of paths and you need to check whether they exist

isdir and islink are useful to have, because you may glob or listdir and need to filter its members. But you don't need to exists in that case - everything in the directory has to be assumed to exist (and handle the rare race condition when you try to use it).

If someone can show a scenario where you would have a significant (hundreds+) list of paths, need to check whether they exist, but couldn't use one of isfile, isdir or islink, then I could be persuaded on the third point. If there's some reason why this scenario exists for people who can upgrade Python but not update Windows, then I might be convinced on the first point (or alternatively, if lexists doesn't actually get faster with the new stat APIs, which is a possibility).

I don't think we can really reduce the amount of code. If it happened to be shorter and easier to follow then I'd be less concerned about long-term maintenance, but I'm pretty sure it's as good as it gets (without adding indirection and hurting the performance again - same tradeoff we made with the earlier _path_* functions).

@barneygale
Copy link
Contributor

If someone can show a scenario where you would have a significant (hundreds+) list of paths, need to check whether they exist, but couldn't use one of isfile, isdir or islink, then I could be persuaded on the third point.

On this specifically, one example would be globbing for */__init__.py. The quickest way to implement that is os.scandir() for an initial set of paths, join __init__.py onto each of them, and then call lstat() to filter out nonexistent paths. This is the approach taken in the pathlib globbing implementation:

cpython/Lib/glob.py

Lines 508 to 520 in a7711a2

def select_exists(self, path, exists=False):
"""Yields the given path, if it exists.
"""
if exists:
# Optimization: this path is already known to exist, e.g. because
# it was returned from os.scandir(), so we skip calling lstat().
yield path
else:
try:
self.lstat(path)
yield path
except OSError:
pass

Note that glob results can include dangling symlinks, hence lstat() rather than stat().

@eryksun
Copy link
Contributor

eryksun commented May 2, 2024

* `stat` is going to get fast enough on newer Windows for this optimisation to be redundant

GetFileInformationByName() won't be generally available until Windows 11 24H2 later this year, right? What about older Windows 10/11 systems?

* it's a lot of code and not easily maintainable

The implementation was consolidated with _path_exists() to lessen duplicate code. It was also reordered to avoid the need the need for the close_file variable. But I agree that all of these _path_is* and _path_[l]exists helpers are a lot of code to maintain, taken together. It could be refactored into smaller inline helper functions that can be reused, which would also make the code more readable. nineteendo seems to be pretty good at and enthusiastic about optimizing code.

@nineteendo
Copy link
Contributor Author

But I agree that all of these _path_is* and _path_[l]exists helpers are a lot of code to maintain, taken together. It could be refactored into smaller inline helper functions that can be reused, which would also make the code more readable.

Doesn't seem too difficult for _path_isdir & _path_isfile:

Screenshot 2024-05-02 at 09 14 05

@nineteendo
Copy link
Contributor Author

nineteendo commented May 2, 2024

Could we simplify this? It already seems to know if it's a directory of file:

cpython/Modules/posixmodule.c

Lines 5123 to 5129 in a6b610a

if (!(statInfo.FileAttributes & FILE_ATTRIBUTE_REPARSE_POINT)) {
slow_path = FALSE;
result = statInfo.FileAttributes & FILE_ATTRIBUTE_DIRECTORY;
} else if (!(statInfo.FileAttributes & FILE_ATTRIBUTE_DIRECTORY)) {
slow_path = FALSE;
result = 0;
}

cpython/Modules/posixmodule.c

Lines 5220 to 5226 in a6b610a

if (!(statInfo.FileAttributes & FILE_ATTRIBUTE_REPARSE_POINT)) {
slow_path = FALSE;
result = !(statInfo.FileAttributes & FILE_ATTRIBUTE_DIRECTORY);
} else if (statInfo.FileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
slow_path = FALSE;
result = 0;
}

@eryksun
Copy link
Contributor

eryksun commented May 2, 2024

Could we simplify this? It already seems to know if it's a directory of file:

Yes, the fast path can be simplified. If it's not a reparse point, isdir() and isfile() can always use the by-name stat information. That's implemented. isdir() can return False for a file reparse point. That's implemented. Otherwise it has to ensure that the reparsed target exists. This could be improved. Falling back to the full slow path does more work than necessary since it only needs to check existence via CreateFileW(), not the GetFileInformationByHandleEx() query. Similarly isfile() has to verify existence for a file reparse point.

@nineteendo nineteendo marked this pull request as draft May 8, 2024 11:43
@nineteendo nineteendo closed this May 9, 2024
@nineteendo
Copy link
Contributor Author

Tracking further in #118755.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants