Skip to content

Option for closing files with scipy backend #468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

rabernat
Copy link
Contributor

This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed".

I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active.

This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks.

This option can be accessed with the close_files key word, which I added to api.

Timing for loading and doing a calculation with close_files=True:

count_open_files()
%time mfds =  xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=True)
count_open_files()
%time print float(mfds.variables['u'].mean())
count_open_files()

output:

3 open files
CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s
Wall time: 27.7 s
2 open files
0.0055650632367
CPU times: user 649 ms, sys: 974 ms, total: 1.62 s
Wall time: 633 ms
2 open files

Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior):

count_open_files()
%time mfds =  xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=False)
count_open_files()
%time print float(mfds.variables['u'].mean())
count_open_files()
3 open files
CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms
Wall time: 291 ms
22 open files
0.0055650632367
CPU times: user 174 ms, sys: 141 ms, total: 315 ms
Wall time: 56 ms
22 open files

This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it...

@shoyer
Copy link
Member

shoyer commented Jul 15, 2015

Sorry for letting this linger... I'm going to run some tests on this later and see if I notice the same performance issues.

@rabernat
Copy link
Contributor Author

I would honestly not waste too much time on this pull request. It is a messy solution. I just wanted to see if it could actually be done. (It can.)

@shoyer
Copy link
Member

shoyer commented Jul 15, 2015

I'm mostly curious about the performance overhead, for which I think this solution would be fine.

On Tue, Jul 14, 2015 at 7:00 PM, Ryan Abernathey [email protected]
wrote:

I would honestly not waste too much time on this pull request. It is a messy solution. I just wanted to see if it could actually be done. (It can.)

Reply to this email directly or view it on GitHub:
#468 (comment)

@rabernat
Copy link
Contributor Author

rabernat commented Aug 9, 2015

No. I meant to close #522. I don't know why this was closed.

@rabernat
Copy link
Contributor Author

I don't quite understand what happened here. The PR was deleted when I rebased my master branch. My feature branch still exists: https://github.com/rabernat/xray/tree/close_files

Should I just open a new PR?

@jhamman
Copy link
Member

jhamman commented Aug 10, 2015

Ah, yes. Open a new PR. You must have pushed your rebased master to your fork.

In general, I never edit on my fork's master branch. My typical workflow is to create separate feature branches off the xray/xray:master branch:

git checkout master
git pull upstream master
git checkout -b feature/my_feature
# ...edit...
git add edited_files*
git commit ...
git push origin feature/my_feature
# create pull request from my fork's feature/my_feature to xray:master on github

@rabernat
Copy link
Contributor Author

Thanks @jhamman. That's what I thought I did, but I must have screwed up and created my PR from my master branch... Anyway, this is now #524.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants