Performance improvement for `Reader.__shapeIndex` #52

ARF1 · 2016-04-07T17:56:12Z

Improves the performance of Reader.__shapeIndex.

x305 speedup over master with numpy available
x19 speedup over master without numpy

(Benchmarked on CPython 3.4.4 on Windows 8.1)

x2.0 speedup over master

x1.7 speedup over previous commit

x1.3 speedup over previous commit

x1.7 speedup over previous commit

…elements x1.5 speedup over previous commit Due to unexplained reasons, this seems to significantly speed up the apparently untouched read().

micahcochran · 2016-04-07T18:32:54Z

by a factor of x10.2.

🐇 💨 🏃 💨 🏇 💨 If only all performance improvements were this easy to achieve.

billgale · 2016-04-07T18:55:55Z

Tell me about it.

Could not resist.

From: Micah Cochran [mailto:[email protected]]
Sent: Thursday, April 07, 2016 1:33 PM
To: GeospatialPython/pyshp [email protected]
Subject: Re: [GeospatialPython/pyshp] Performance improvement for Reader.__shapeIndex (#52)

by a factor of x10.2.

[:rabbit2:][:dash:][:runner:][:dash:][:horse_racing:][:dash:]If only all performance improvements were this easy to achieve.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHubhttps://github.com//pull/52#issuecomment-207040246

ARF1 · 2016-04-07T19:38:01Z

This PR was intended to judge how performance tweaks would be received by the maintainer(s?) of this project. If viewed upon favourably, I am planning to apply the same principles to Reader.__shape where the real gains could probably be made.

Does this project have a maintainer that could give a tentative indication?

Also: does this project have a test suite? If I blunder about in the code, I would feel better if I knew that I was not breaking thinks inadvertently. I see a test and shapefiles folder but have no idea how to use them for testing.

micahcochran · 2016-04-07T19:58:20Z

Does this project have a maintainer that could give a tentative indication?

I think @GeospatialPython is the only maintainer.

does this project have a test suite?

Yes, there are some good doctests in the README.txt, which is great. I had to do a little bit of work to get it to work under Travis PR #43 (mainly ran dos2unix). The README.txt really needs to be installed with shapefile in setup.py for the unittests to work standalone. If you download the repository and then run >>> shapefile.test() from python console or '$ python shapefile.py' from the prompt. Make sure that the README.txt is in your local directory. It should work on a Windows box (or run $ dos2unix README.txt on a Linux box).

GeospatialPython · 2016-04-07T20:16:06Z

Absolutely! I'm open to all improvements and even other maintainers. In fact I would love to have more maintainers. My only guiding principles for this library are to keep it as a single file and to only use Python standard libraries. No folder module stricture or multiple files, no compiled C code outside the standard library, no libraries outside a standard Python install.

ARF1 · 2016-04-08T04:56:54Z

@GeospatialPython Great! Thanks for the feedback.

re additional maintainers: I am probably not the best person for the job as I am only solving a one-off spatial problem. Normally I have absolutely nothing to do with shapefiles, etc.

Edit: never mind re python 2.6 compatibility. We can have memoryview and keep python 2.6 compatibility after all.

x2.0 speedup over previous commit

karimbahgat · 2016-04-08T07:19:58Z

@ARF1 how exactly have you used memoryview for those types of gains? Seems to me the only way it would help is when calling "records()" or "shapes()" to load all items at once and return them in a list. In this case the items would be loaded as usual, but then instead of returning the items in a list you return them in a memoryview as a wrapper around the list. The way i understand the benefits of memoryviews is they allow faster slicing and indexing of some preexisting buffer, so if i am correct, any speedup would be dependent on what types of action one does to the returned list, so not necessarily faster loading or iteration? Or maybe there was another way you were thinking?

Re py26, it should still be possible to implement memoryview without losing 2.6, returning data as original list if 2.6, or wrapped in a memoryview for later versions.

On 08 Apr 2016, at 06:56, ARF1 [email protected] wrote:

@GeospatialPython Great! Thanks for the feedback.

re additional maintainers: I am probably not the best person for the job as I am only solving a one-off spatial problem. Normally I have absolutely nothing to do with shapefiles, etc.

re principles: what about python 2.6 compatibility? Python 2.7 introduced memoryview which is great for high performance byte-juggling.

With memoryview I am now at a speed up of x20.8 vs master.

@micahcochran @billgale How attached are the other users of pyshp to python 2.6 compatibility?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

x378 speedup over master with numpy available x22 speedup over master without numpy

ARF1 · 2016-04-09T16:02:36Z

@karimbahgat The performance gains stem not primarily from the use of memoryview but from

avoiding the (implicit) creation of python objects where possible
streaming data from disk en bloc to memory rather than reading it piecewise

memoryview somehow helps with the second point. Not sure why that is the case. For that you really would need to look at the python internals.

Edit:

The memoryview speed-up and the two identified points above were artefacts of my profiler. Use of memoryview has not performance implications what-so-ever. All other performance improvements hold up.

micahcochran · 2016-08-09T18:22:47Z

@ARF1 Can you provide the code that you used in order to benchmark this?

ARF1 · 2016-08-10T12:03:55Z

@micahcochran I am sorry but I moved on from this project. As I explained above, I was only solving a one-off spatial problem.

Also I found that ShapeIndex is much easier to optimize than the efficient pipe-lining to Shapely. For the latter (where the true perfomance gains lie) non-trivial changes to the codebase are required.

But you can rig up the test-code yourself easily:

Download a OSM land polygon shapefile.
Open it and request some arbitrary shape by index.

This should first build the ShapeIndex before returning the data. Voila...

karimbahgat · 2016-08-16T07:46:21Z

🐇 💨 🏃 💨 🏇 💨 If only all performance improvements were this easy to achieve.

It seems they just might be. Based on the principle of batch reading/unpacking as in this PR, with only a few lines of trivial changes it was possible to get 5-8x performance boost for shapes, and 15-20x for listrecords. See #62 . And in case @ARF1's PR stalls due to the numpy inclusion (albeit optional), I also added an even more minimal version of the shapindex speedup.

karimbahgat · 2016-09-11T21:37:40Z

Absolutely! I'm open to all improvements and even other maintainers. In fact I would love to have more maintainers.

Actually, if you're still open for additional maintainers @GeospatialPython I wouldn't mind helping out? I can't commit to anything specifically, but I'd likely be able to sort out some of the issues and PRs from time to time. I rely on pyshp on an almost daily basis, and like you I want pyshp to stay simple and would try to keep any changes minimal.

karimbahgat · 2016-09-24T21:02:19Z

Implemented without numpy in #62.

ARF added 5 commits April 7, 2016 15:03

__shapeIndex optimization: removed superfluous tell()

6e2ff4e

x2.0 speedup over master

__shapeIndex optimization: remove seek() call

6e26248

x1.7 speedup over previous commit

__shapeIndex optimization: use list comprehension

8f671db

x1.3 speedup over previous commit

__shapeIndex optimization: read into array and discard unneeded elements

efafff8

x1.7 speedup over previous commit

__shapeIndex optimization: explicit format string excluding unneeded …

4c0b9b4

…elements x1.5 speedup over previous commit Due to unexplained reasons, this seems to significantly speed up the apparently untouched read().

__shapeIndex optimization: use memoryview slicing to skip unused field

eec5efe

x2.0 speedup over previous commit

ARF1 force-pushed the perf-opt-Reader__shapeIndex branch 3 times, most recently from eec5efe to 4c0b9b4 Compare April 8, 2016 06:13

__shapeIndex optimization: use numpy when available for array arithmetic

fb25a04

x378 speedup over master with numpy available x22 speedup over master without numpy

ARF1 force-pushed the perf-opt-Reader__shapeIndex branch 3 times, most recently from c42ad69 to fb25a04 Compare April 9, 2016 13:40

karimbahgat mentioned this pull request Aug 16, 2016

Significant speedups with minimal changes #62

Merged

karimbahgat closed this Sep 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement for `Reader.__shapeIndex` #52

Performance improvement for `Reader.__shapeIndex` #52

ARF1 commented Apr 7, 2016

micahcochran commented Apr 7, 2016

billgale commented Apr 7, 2016

ARF1 commented Apr 7, 2016

micahcochran commented Apr 7, 2016

GeospatialPython commented Apr 7, 2016

ARF1 commented Apr 8, 2016

karimbahgat commented Apr 8, 2016

ARF1 commented Apr 9, 2016 •

edited

Loading

micahcochran commented Aug 9, 2016

ARF1 commented Aug 10, 2016

karimbahgat commented Aug 16, 2016

karimbahgat commented Sep 11, 2016

karimbahgat commented Sep 24, 2016

Performance improvement for Reader.__shapeIndex #52

Performance improvement for Reader.__shapeIndex #52

Conversation

ARF1 commented Apr 7, 2016

micahcochran commented Apr 7, 2016

billgale commented Apr 7, 2016

ARF1 commented Apr 7, 2016

micahcochran commented Apr 7, 2016

GeospatialPython commented Apr 7, 2016

ARF1 commented Apr 8, 2016

karimbahgat commented Apr 8, 2016

ARF1 commented Apr 9, 2016 • edited Loading

micahcochran commented Aug 9, 2016

ARF1 commented Aug 10, 2016

karimbahgat commented Aug 16, 2016

karimbahgat commented Sep 11, 2016

karimbahgat commented Sep 24, 2016

Performance improvement for `Reader.__shapeIndex` #52

Performance improvement for `Reader.__shapeIndex` #52

ARF1 commented Apr 9, 2016 •

edited

Loading