Skip to content

bpo-29659: Expose copyfileobj() length arg for public use #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

goodboy
Copy link

@goodboy goodboy commented Feb 27, 2017

Corresponds to issue bpo-29659

When copying large files the copy time can be sufficiently
shortened by increasing the memory buffer used in the copyfileobj()
routine. Expose the length argument from copyfileobj() upwards
for use throughout the module.

https://bugs.python.org/issue29659

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately our records indicate you have not signed the CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow these steps to rectify the issue:

  1. Sign the PSF contributor agreement. The "bugs.python.org username" requested by the form is the "Login name" field in "Your Details" at b.p.o
  2. Wait at least one US business day and then check the "Contributor form received entry under "Your Details" on bugs.python.org to see if your account has been marked as having signed the CLA (the delay is due to a person having to manually check your signed CLA)
  3. Reply here saying you have completed the above steps

Thanks again to your contribution and we look forward to looking at it!

@Mariatta Mariatta changed the title Expose copyfileobj() length arg for public use bpo-29659: Expose copyfileobj() length arg for public use Feb 27, 2017
@goodboy
Copy link
Author

goodboy commented Feb 27, 2017 via email

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must update Doc/library/shutil.rst: document the new parameter, but also add a ".. versionchanged:: 3.7" section to document the change.

Lib/shutil.py Outdated

An in-memory buffer size can be set with `length`.
"""
length = length or 16*1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest:

if not length:
length = 16 * 1024

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Lib/shutil.py Outdated
def copyfileobj(fsrc, fdst, length=None):
"""Copy data from file-like object `fsrc` to file-like object `fdst`.

An in-memory buffer size can be set with `length`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should document the default size: 16 kB.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👍

Lib/shutil.py Outdated
"""Copy data from src to dst.

If follow_symlinks is not set and src is a symbolic link, a new
symlink will be created instead of copying the file it points to.

An in memory buffer size can be set with `length`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: document default size.

@vstinner vstinner added the type-feature A feature request or enhancement label Mar 2, 2017
@goodboy goodboy force-pushed the expose_shutil_membuf-29659 branch from 4feede0 to 39a9b83 Compare March 3, 2017 05:25
@goodboy
Copy link
Author

goodboy commented Mar 3, 2017

@Haypo I believe I've addressed everything!
If there's anything I'm missing please let me know :)

@@ -74,6 +74,9 @@ Directory and files operations
Raise :exc:`SameFileError` instead of :exc:`Error`. Since the former is
a subclass of the latter, this change is backward compatible.

.. versionchanged:: 3.7
Expose the `length` arg from `copyfileobj`; default is 16 KB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn'd say "expose ..." but just "Added length parameter", and versionadded is not the right place to document "default is ...", put it in the documentation body.

You didn't update copyfile() signature. Same remark for the 2 other functions.

Moreover, you should also document the length in the function document, not only in versionadded. The "documentation" can just ask to refer to copyfileobj().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot yeah not sure how I missed the signatures...

Lib/shutil.py Outdated
def copyfileobj(fsrc, fdst, length=None):
"""Copy data from file-like object `fsrc` to file-like object `fdst`.

An in-memory buffer size can be set with `length`; the default is 16 KB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please mention the unit, "size in bytes"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Haypo to clarify, do you prefer:
'An in-memory buffer size can be set with length; the default size in bytes is 16*1024',
or
"An in-memory buffer size in bytes can be set with length; the default is 16 KB.
?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"An in-memory buffer size in bytes can be set withlength; the default is 16 KB."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encourage you to use KiB, which other parts of the documentation already use. If you write 16 KB it could easily be interpreted as 16,000 B (lower-case k means 1000, as in km for kilometre).

@goodboy goodboy force-pushed the expose_shutil_membuf-29659 branch from 39a9b83 to 158a71b Compare March 3, 2017 19:24
@goodboy
Copy link
Author

goodboy commented Mar 3, 2017

@Haypo hopefully 3rd times the charm ;)

If you'd like me to squash the history to one commit let me know!

Copy link
Member

@terryjreedy terryjreedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: leave API of copyfileobj should alone as 'length=16*1024' to document default. For other functions, say something like 'If length not passed, copyfileobj uses its default.' We might change it someday.

Required: new or revised tests that use the expanded api (and fail without patch).

@goodboy
Copy link
Author

goodboy commented Mar 4, 2017

@terryjreedy the only problem with that is that we'll need to have the if length checks in all consuming functions as opposed to only once in copyfileobj no? Maybe you have a better approach?

Yes I totally agree about the test. I guess it goes here?

@vstinner
Copy link
Member

vstinner commented Mar 4, 2017 via email

@@ -74,6 +80,9 @@ Directory and files operations
Raise :exc:`SameFileError` instead of :exc:`Error`. Since the former is
a subclass of the latter, this change is backward compatible.

.. versionchanged:: 3.7
Added `length` parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation should use three spaces here:

.. versionchanged:: 3.7
   Added the *length* parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`length` -> *length*

(and please finish the sentence with a full stop.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

When copying large files the copy time can be sufficiently
shortened by increasing the memory buffer used in the `copyfileobj()`
routine.  Expose the `length` argument from `copyfileobj()` upwards
for use throughout the module.
@goodboy goodboy force-pushed the expose_shutil_membuf-29659 branch from 158a71b to bfeaee8 Compare September 15, 2017 03:40

Note that if the current file position of the *fsrc* object is not 0,
only the contents from the current file position to the end of the file
will be copied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO “Note that” is redundant. In fact the whole first line could be dropped, and “Only the contents from the current file position . . .” could be put back in the first paragraph.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please address pending comments before I can approve the change ;-)

Lib/shutil.py Outdated
"""Copy data from file-like object `fsrc` to file-like object `fdst`.

An in-memory buffer size in bytes can be set with `length`; the default is
16 KB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: KiB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Lib/shutil.py Outdated
16 KB.
"""
if not length:
length = 16*1024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 8: add spaces around operators, "16 * 1024"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Lib/shutil.py Outdated
@@ -251,10 +262,13 @@ def copy2(src, dst, *, follow_symlinks=True):
If follow_symlinks is false, symlinks won't be followed. This
resembles GNU's "cp -P src dst".

An in-memory buffer size in bytes can be set with `length`; the default is
16 KB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this empty line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vstinner
Copy link
Member

vstinner commented Nov 6, 2017

Note: I created PR #4293 to replace KB with KiB in the Python code base ;-)

@goodboy goodboy force-pushed the expose_shutil_membuf-29659 branch from bfeaee8 to 0043566 Compare November 20, 2017 01:27
@goodboy
Copy link
Author

goodboy commented Nov 20, 2017

I think everything has been addressed except for @terryjreedy's:

Suggestion: leave API of copyfileobj should alone as 'length=16*1024' to document default. For other functions, say something like 'If length not passed, copyfileobj uses its default.' We might change it someday.

I don't know how to keep the default value in copyfileobj but also support it as a default argument in all client functions without length = 16 * 1024 if length is None else length in each case. Personally I think that's a bit uglier. I can also stick the default value in each client function to be length=16 * 1024 and always pass it through transparently if that is preferred; it just means changing the value in multiple places if it is changed in the future. I had originally opted for one change in one place if it was to be modified later.

Whatever you all think is best I'll do 👍

src = os.path.join(src_dir, 'foo')
write_file(src, 'foo' * 10 ** 6)
# buffer with 20 Kib
fn(src, dst_dir, length=20 * 1024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to use a smaller file: write('x' * 100) and then length=20.

By the way, I suggest to use tempfile.NamedTemporaryFile rather than temporary files:

            with tempfile.NamedTemporaryFile() as src:
                with tempfile.NamedTemporaryFile() as dst:
                    write_file(src.name, b'x' * 100, binary=True)
                    fn(src.name, dst.name, length=20)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe thanks @vstinner! You did the work for me 👍

@goodboy goodboy force-pushed the expose_shutil_membuf-29659 branch from 0043566 to 8daf7a8 Compare November 20, 2017 21:51
@@ -73,8 +73,14 @@ class RegistryError(Exception):
and unpacking registries fails"""


def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
def copyfileobj(fsrc, fdst, length=None):
Copy link
Contributor

@AraHaan AraHaan Nov 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def copyfileobj(fsrc, fdst, length=16*1024):

I personally prefer this way in my code to concentrate 2 lines worth of code into one.

    if not length:
        length = 16 * 1024

It looks a little nicer and not to mention saves lines of code that basically do the same thing.
And yes I am also guilty of doing it this way and then people started hating me for it and calling my a bad programmer for it.

@goodboy
Copy link
Author

goodboy commented Nov 21, 2017

Just to clear this up since it seems my original reasoning isn't showing well and the drive-by comments are piling up.

Afaict there are 3 ways to deal with the default value of length:

  • stick length=16*1024 in all three function signatures for copyfileobj, copy, and copy2

    • this means any time the default needs to be changed it will be done in 3 places
    • this is the least amount of lines of code
  • keep it the way it is (i.e. copy and copy2 define length=None in their signatures and one function, copyfileobj, has the current if length is None: check)

    • one more line of code then the previous solution
    • have to change the length in one place in the future
    • length is an explicit default argument in copy and copy2
  • make copy and copy2 take **kwargs and pass them through to the internal call to copyfileobj

    • same lines of code as the first solution
    • doesn't explicitly document length in each of the the copy, copy2 signatures
    • one place to change the default value in the future

If one of these solutions is preferred over the others after acknowledging the trade-offs I'm fine to do it, but just simply saying length=16*1024 is less code doesn't deal with how to handle copy and copy2 and the extra (or less clear) code they may contain.

@goodboy
Copy link
Author

goodboy commented Nov 21, 2017

@vstinner looks like the appveyor build failed due to the with open(path, 'wb' if binary else 'w') as fp: line?

@vstinner
Copy link
Member

@vstinner looks like the appveyor build failed due to the with open(path, 'wb' if binary else 'w') as fp: line?

ERROR: test_copy_w_different_length (test.test_shutil.TestShutil)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\cpython\lib\test\test_shutil.py", line 1417, in test_copy_w_different_length
    write_file(src.name, b'x' * 100, binary=True)
  File "C:\projects\cpython\lib\test\test_shutil.py", line 60, in write_file
    with open(path, 'wb' if binary else 'w') as fp:
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\tmpj1vhtx1b'

Sorry, I don't know why the test failed.

@AraHaan
Copy link
Contributor

AraHaan commented Nov 22, 2017

Looks to me like an normal permission error as python seems to not have permissions to write to the temp folder? Maybe somehow have python write to somewhere that does not require administrator permissions? It would be nice if python had an nice function that you provide the path and it checks if you have permissions to write to there to avoid having to try / except permission errors. 🤔 That function would be the perfect thing to fix this test on Windows.

@goodboy
Copy link
Author

goodboy commented Dec 11, 2017

Ok so besides the outstanding:

  • missing news entry
  • failing test

Is there anything else in the patch that needs to be addressed?

@AraHaan
Copy link
Contributor

AraHaan commented Dec 11, 2017

I dont think so.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike length=-1 "feature".

16 KiB.
"""
if not length:
length = 16 * 1024
while 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it's ok to use a loop if the length is negative. I suggest to have a special case for negative value calling read() (no parameter) only once.

The integer *length*, if given, is the buffer size; the default value
in bytes is 16 KiB. A negative *length* value means to copy the data without
looping over the source data in chunks; by default the data is read in
chunks to avoid uncontrolled memory consumption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A negative length value means to copy the data without looping over the source data in chunks"

I dislike this definition. In practice, negative means "unlimited" buffer size: the whole input file is loaded into memory.

I'm not sure that it's a good practice to try to load files of unknown size into memory.

I suggest to remove this feature which seems more like a side effect than a carefully designed API.

If you want to get fast copy, pass a very large length like 1 GB. But if Python starts to load 1 TB into memory, it's likely to crash the system... At least, to slow down the system, a lot.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@vstinner
Copy link
Member

Fixed by https://bugs.python.org/issue33671

klange pushed a commit to toaruos/cpython that referenced this pull request Sep 15, 2021
jaraco pushed a commit that referenced this pull request Dec 2, 2022
Bumps [gidgethub](https://github.com/brettcannon/gidgethub) from 4.0.0 to 4.1.0.
- [Release notes](https://github.com/brettcannon/gidgethub/releases)
- [Changelog](https://github.com/brettcannon/gidgethub/blob/master/docs/changelog.rst)
- [Commits](https://github.com/brettcannon/gidgethub/commits/v4.1.0)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Co-authored-by: Mariatta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants