-
Notifications
You must be signed in to change notification settings - Fork 1.1k
SURFRAD site & date-range download #1155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe I should've posted this in the group first? Is there any appetite for this? Better as a script or as a module? Logging okay or less? |
I think the existing function is OK as is because SURFRAD publishes daily files. Its not the intent to use that function to read a year of data. I have downloaded years of SURFRAD data using wget. Its not fast but its a single command line statement. Reading into memory isn't too bad using the pvlib function. But I can see the utility of having both steps in python. What about adding a script to the example gallery as a first step? I'm cautious about adding a |
Following the patterns in some other I'm leery of adding an example that uses a thread pool since any non-trivial io in the docs seems to eventually cause problems. |
Thanks all!
I totally agree! After iterating a bit I decided the existing parsing function is fine, but I just wanted a faster way to download the raw SURFRAD
This might work. It's not the way I started, but it could be more convenient for folks who want a date range, especially within a single year. There's a limit to how many FTP connections NOAA will accept, it seems to be exactly 5. Also an existing FTP connection is capable of downloading many files serially, quite quickly. Also the FTP connection is like a file system, I think I can use a full path, but I've been changing directories. So in theory we could open up 5 connections, break up the date range into 5 chunks, and then read them until they're done. That makes a lot of sense, probably more straightforward than my approach. Thanks! |
@mikofski Pull request #1254 adds a retrieval function for the monthly data files on the BSRN server. As SURFRAD is part of BSRN this should offer a much quicker way of getting SURFRAD data and perhaps this issue can be closed? It's worth mentioning that the SURFRAD files include some additional data that the BSRN files do not, such as wind speed and direction and a corresponding flag column for each variable. |
let me mull it over, I don't know how much overlap there is but my gut tells me folks will still want to use the raw SURFRAD iotools. |
It may be worth benchmarking the retrieval speeds for each data source before trying to improve the raw surfrad fetch. But removing the existing surfrad fetch/read functions is not on the table. |
Wow, that's 3 times faster, but still over a day for 25 years of data. @AdamRJensen can you ask your contact how many HTTPS connections are allowed from the same host? I still think threading this request is the way to go? But maybe we leave that to the user? Any complaints if I close this issue now? I don't think I'll work on it, and funny thing is you only need to download the SURFRAD data once. Maybe this is better as an gallery example? |
Is your feature request related to a problem? Please describe.
the current SURFRAD
iotools
only reads in a singe day.dat
from either an URL or a filesystem, EG:Unfortunately, I can't quickly read an entire range or any arbitrarily large date range. I can use
pvlib.iotools.read_surfrad
in a loop, but it takes a long time to serially read in an entire year. Maybe it would be faster if I already had the files downloaded. It takes about 1-second to read a single 111kb file. So for 10,000 files that would be about 3 hours which is too long if I have to read 7 sites.That's 14.4[s] / 16[files] = 0.9[s] per file. I tried to use threading, but then I get connection errors. I think there's a limit of 5 connections to the NOAA ftp from your computer. That should bring it down to about 30 minutes, hmm, maybe I didn't try hard enough? Anyway, I went a different way.
Describe the solution you'd like
The current
read_surfrad
uses Python'surllib.requests.urlopen
for each connection. I have found that opening a long lasting FTP connection using Python'sftplib
allows downloading a lot more files by reusing the same connection. However this download is still serial, so I have found in addition using Pythonthreading
allows me to open up to 5 simultaneous connections, but any more and I get a 421 FTP Connection Error, too many connections.Describe alternatives you've considered
I was able to open the FTP site directly in Windows, but it was also a serial connection, and so for about 10,000 (about 1gb) would have taken 4 hours. By contrast, using
ftplib
andthreading
I can download all of the data from a single site in about 25 mintes.Additional context
#590
#595
gist of my working script: https://gist.github.com/mikofski/30455056b88a5d161598856cc4eedb2c
The text was updated successfully, but these errors were encountered: