Skip to content

SURFRAD site & date-range download #1155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mikofski opened this issue Jan 30, 2021 · 9 comments
Open

SURFRAD site & date-range download #1155

mikofski opened this issue Jan 30, 2021 · 9 comments

Comments

@mikofski
Copy link
Member

mikofski commented Jan 30, 2021

Is your feature request related to a problem? Please describe.
the current SURFRAD iotools only reads in a singe day .dat from either an URL or a filesystem, EG:

# read from url
pvlib.iotools.read_surfrad('ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Bondville_IL/2021/bon21001.dat')
# read from file
pvlib.iotools.read_surfrad('bon21001.dat')

Unfortunately, I can't quickly read an entire range or any arbitrarily large date range. I can use pvlib.iotools.read_surfrad in a loop, but it takes a long time to serially read in an entire year. Maybe it would be faster if I already had the files downloaded. It takes about 1-second to read a single 111kb file. So for 10,000 files that would be about 3 hours which is too long if I have to read 7 sites.

%%timeit
bon95 = [
    pvl.iotools.read_surfrad(r'ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Bondville_IL/1995/bon95%03d.dat' % (x+1))
    for x in range(16)]  # read in 16 files

## -- End pasted text --
14.4 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's 14.4[s] / 16[files] = 0.9[s] per file. I tried to use threading, but then I get connection errors. I think there's a limit of 5 connections to the NOAA ftp from your computer. That should bring it down to about 30 minutes, hmm, maybe I didn't try hard enough? Anyway, I went a different way.

Describe the solution you'd like
The current read_surfrad uses Python's urllib.requests.urlopen for each connection. I have found that opening a long lasting FTP connection using Python's ftplib allows downloading a lot more files by reusing the same connection. However this download is still serial, so I have found in addition using Python threading allows me to open up to 5 simultaneous connections, but any more and I get a 421 FTP Connection Error, too many connections.

Describe alternatives you've considered
I was able to open the FTP site directly in Windows, but it was also a serial connection, and so for about 10,000 (about 1gb) would have taken 4 hours. By contrast, using ftplib and threading I can download all of the data from a single site in about 25 mintes.

Additional context
#590
#595
gist of my working script: https://gist.github.com/mikofski/30455056b88a5d161598856cc4eedb2c

@mikofski mikofski added the io label Jan 30, 2021
@mikofski
Copy link
Member Author

Maybe I should've posted this in the group first? Is there any appetite for this? Better as a script or as a module? Logging okay or less?

@cwhanse
Copy link
Member

cwhanse commented Feb 1, 2021

I think the existing function is OK as is because SURFRAD publishes daily files. Its not the intent to use that function to read a year of data.

I have downloaded years of SURFRAD data using wget. Its not fast but its a single command line statement. Reading into memory isn't too bad using the pvlib function.

But I can see the utility of having both steps in python. What about adding a script to the example gallery as a first step? I'm cautious about adding a get_surfrad function since these functions are the most troublesome to maintain.

@wholmgren
Copy link
Member

Following the patterns in some other iotools modules, I'll suggest 1. refactoring the io components out of read_surfrad so that it only performs the parsing and 2. making a new function read_surfrad_from_noaa_ftp(site, start, end) that manages a thread pool. It seems to me that work should be dispatched by the day, not by the year.

I'm leery of adding an example that uses a thread pool since any non-trivial io in the docs seems to eventually cause problems.

@mikofski
Copy link
Member Author

mikofski commented Feb 2, 2021

Thanks all!

I think the existing function is OK

I totally agree! After iterating a bit I decided the existing parsing function is fine, but I just wanted a faster way to download the raw SURFRAD .dat files. For 7 sites and 25 years of data, and my last minute work ethic, waiting 28 hours just wasn't feasible 🤣

It seems to me that work should be dispatched by the day, not by the year.

This might work. It's not the way I started, but it could be more convenient for folks who want a date range, especially within a single year. There's a limit to how many FTP connections NOAA will accept, it seems to be exactly 5. Also an existing FTP connection is capable of downloading many files serially, quite quickly. Also the FTP connection is like a file system, I think I can use a full path, but I've been changing directories. So in theory we could open up 5 connections, break up the date range into 5 chunks, and then read them until they're done. That makes a lot of sense, probably more straightforward than my approach.

Thanks!

@AdamRJensen
Copy link
Member

@mikofski Pull request #1254 adds a retrieval function for the monthly data files on the BSRN server. As SURFRAD is part of BSRN this should offer a much quicker way of getting SURFRAD data and perhaps this issue can be closed?

It's worth mentioning that the SURFRAD files include some additional data that the BSRN files do not, such as wind speed and direction and a corresponding flag column for each variable.

@mikofski
Copy link
Member Author

let me mull it over, I don't know how much overlap there is but my gut tells me folks will still want to use the raw SURFRAD iotools.

@wholmgren
Copy link
Member

It may be worth benchmarking the retrieval speeds for each data source before trying to improve the raw surfrad fetch. But removing the existing surfrad fetch/read functions is not on the table.

@AdamRJensen
Copy link
Member

@mikofski As discussed in #1459, SURFRAD files are both available via FTP and more recently HTTPS. It seems there is a significant performance gain (at least a factor of two) to be had by using the HTTPS links (see test below). I figured this might be relevant information to this issue.

image

@mikofski
Copy link
Member Author

mikofski commented Aug 5, 2022

Wow, that's 3 times faster, but still over a day for 25 years of data. @AdamRJensen can you ask your contact how many HTTPS connections are allowed from the same host? I still think threading this request is the way to go? But maybe we leave that to the user?

Any complaints if I close this issue now? I don't think I'll work on it, and funny thing is you only need to download the SURFRAD data once. Maybe this is better as an gallery example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants