Skip to content

Performance improvement for Reader.__shapeIndex #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ARF1
Copy link

@ARF1 ARF1 commented Apr 7, 2016

Improves the performance of Reader.__shapeIndex.

x305 speedup over master with numpy available
x19 speedup over master without numpy

(Benchmarked on CPython 3.4.4 on Windows 8.1)

ARF added 5 commits April 7, 2016 15:03
x1.7 speedup over previous commit
x1.3 speedup over previous commit
…elements

x1.5 speedup over previous commit

Due to unexplained reasons, this seems to significantly speed up the apparently
untouched read().
@micahcochran
Copy link
Contributor

by a factor of x10.2.

🐇 💨 🏃 💨 🏇 💨 If only all performance improvements were this easy to achieve.

@billgale
Copy link

billgale commented Apr 7, 2016

Tell me about it.

Could not resist.

From: Micah Cochran [mailto:[email protected]]
Sent: Thursday, April 07, 2016 1:33 PM
To: GeospatialPython/pyshp [email protected]
Subject: Re: [GeospatialPython/pyshp] Performance improvement for Reader.__shapeIndex (#52)

by a factor of x10.2.

[:rabbit2:][:dash:][:runner:][:dash:][:horse_racing:][:dash:]If only all performance improvements were this easy to achieve.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHubhttps://github.com//pull/52#issuecomment-207040246

@ARF1
Copy link
Author

ARF1 commented Apr 7, 2016

This PR was intended to judge how performance tweaks would be received by the maintainer(s?) of this project. If viewed upon favourably, I am planning to apply the same principles to Reader.__shape where the real gains could probably be made.

Does this project have a maintainer that could give a tentative indication?

Also: does this project have a test suite? If I blunder about in the code, I would feel better if I knew that I was not breaking thinks inadvertently. I see a test and shapefiles folder but have no idea how to use them for testing.

@micahcochran
Copy link
Contributor

Does this project have a maintainer that could give a tentative indication?

I think @GeospatialPython is the only maintainer.

does this project have a test suite?

Yes, there are some good doctests in the README.txt, which is great. I had to do a little bit of work to get it to work under Travis PR #43 (mainly ran dos2unix). The README.txt really needs to be installed with shapefile in setup.py for the unittests to work standalone. If you download the repository and then run >>> shapefile.test() from python console or '$ python shapefile.py' from the prompt. Make sure that the README.txt is in your local directory. It should work on a Windows box (or run $ dos2unix README.txt on a Linux box).

@GeospatialPython
Copy link
Owner

Absolutely! I'm open to all improvements and even other maintainers. In fact I would love to have more maintainers. My only guiding principles for this library are to keep it as a single file and to only use Python standard libraries. No folder module stricture or multiple files, no compiled C code outside the standard library, no libraries outside a standard Python install.

@ARF1
Copy link
Author

ARF1 commented Apr 8, 2016

@GeospatialPython Great! Thanks for the feedback.

re additional maintainers: I am probably not the best person for the job as I am only solving a one-off spatial problem. Normally I have absolutely nothing to do with shapefiles, etc.


Edit: never mind re python 2.6 compatibility. We can have memoryview and keep python 2.6 compatibility after all.

@ARF1 ARF1 force-pushed the perf-opt-Reader__shapeIndex branch 3 times, most recently from eec5efe to 4c0b9b4 Compare April 8, 2016 06:13
@karimbahgat
Copy link
Collaborator

@ARF1 how exactly have you used memoryview for those types of gains? Seems to me the only way it would help is when calling "records()" or "shapes()" to load all items at once and return them in a list. In this case the items would be loaded as usual, but then instead of returning the items in a list you return them in a memoryview as a wrapper around the list. The way i understand the benefits of memoryviews is they allow faster slicing and indexing of some preexisting buffer, so if i am correct, any speedup would be dependent on what types of action one does to the returned list, so not necessarily faster loading or iteration? Or maybe there was another way you were thinking?

Re py26, it should still be possible to implement memoryview without losing 2.6, returning data as original list if 2.6, or wrapped in a memoryview for later versions.

On 08 Apr 2016, at 06:56, ARF1 [email protected] wrote:

@GeospatialPython Great! Thanks for the feedback.

re additional maintainers: I am probably not the best person for the job as I am only solving a one-off spatial problem. Normally I have absolutely nothing to do with shapefiles, etc.

re principles: what about python 2.6 compatibility? Python 2.7 introduced memoryview which is great for high performance byte-juggling.

With memoryview I am now at a speed up of x20.8 vs master.

@micahcochran @billgale How attached are the other users of pyshp to python 2.6 compatibility?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

x378 speedup over master with numpy available
x22 speedup over master without numpy
@ARF1 ARF1 force-pushed the perf-opt-Reader__shapeIndex branch 3 times, most recently from c42ad69 to fb25a04 Compare April 9, 2016 13:40
@ARF1
Copy link
Author

ARF1 commented Apr 9, 2016

@karimbahgat The performance gains stem not primarily from the use of memoryview but from

  1. avoiding the (implicit) creation of python objects where possible
  2. streaming data from disk en bloc to memory rather than reading it piecewise

memoryview somehow helps with the second point. Not sure why that is the case. For that you really would need to look at the python internals.


Edit:

The memoryview speed-up and the two identified points above were artefacts of my profiler. Use of memoryview has not performance implications what-so-ever. All other performance improvements hold up.

@micahcochran
Copy link
Contributor

@ARF1 Can you provide the code that you used in order to benchmark this?

@ARF1
Copy link
Author

ARF1 commented Aug 10, 2016

@micahcochran I am sorry but I moved on from this project. As I explained above, I was only solving a one-off spatial problem.

Also I found that ShapeIndex is much easier to optimize than the efficient pipe-lining to Shapely. For the latter (where the true perfomance gains lie) non-trivial changes to the codebase are required.

But you can rig up the test-code yourself easily:

  1. Download a OSM land polygon shapefile.
  2. Open it and request some arbitrary shape by index.

This should first build the ShapeIndex before returning the data. Voila...

@karimbahgat
Copy link
Collaborator

🐇 💨 🏃 💨 🏇 💨 If only all performance improvements were this easy to achieve.

It seems they just might be. Based on the principle of batch reading/unpacking as in this PR, with only a few lines of trivial changes it was possible to get 5-8x performance boost for shapes, and 15-20x for listrecords. See #62 . And in case @ARF1's PR stalls due to the numpy inclusion (albeit optional), I also added an even more minimal version of the shapindex speedup.

@karimbahgat
Copy link
Collaborator

Absolutely! I'm open to all improvements and even other maintainers. In fact I would love to have more maintainers.

Actually, if you're still open for additional maintainers @GeospatialPython I wouldn't mind helping out? I can't commit to anything specifically, but I'd likely be able to sort out some of the issues and PRs from time to time. I rely on pyshp on an almost daily basis, and like you I want pyshp to stay simple and would try to keep any changes minimal.

@karimbahgat
Copy link
Collaborator

Implemented without numpy in #62.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants