Skip to content

Clean up how pvlib imports optional dependencies #1283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

kandersolar
Copy link
Member

@kandersolar kandersolar commented Aug 16, 2021

  • [ ] Closes #xxxx
  • I am familiar with the contributing guidelines
  • Tests added
  • Updates entries to docs/sphinx/source/api.rst for API changes.
  • Adds description and name entries in the appropriate "what's new" file in docs/sphinx/source/whatsnew for all changes. Includes link to the GitHub Issue with :issue:`num` or this Pull Request with :pull:`num`. Includes contributor name and/or GitHub username (link with :ghuser:`user`).
  • New code is fully documented. Includes numpydoc compliant docstrings, examples, and comments where necessary.
  • Pull request is nearly complete and ready for detailed review.
  • Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

I'd have made an issue for this but I thought a PR would make it easier to see the idea in action.

pvlib has multiple ways of handling optional imports. One is to hide the imports inside the function that needs them, e.g. how solarposition.py handles the ephem dependency:

try:
import ephem
except ImportError:
raise ImportError('PyEphem must be installed')

Another way used in ecmwf_macc.py and #1264, is to keep the import at the top of the file for clarity, but defer an import error until the missing module is actually called on to do something, like this:

try:
import netCDF4
except ImportError:
class netCDF4:
@staticmethod
def Dataset(*a, **kw):
raise ImportError(
'Reading ECMWF data requires netCDF4 to be installed.')

So only when some bit of code actually tries to create a netcdf4.Dataset will the ImportError be raised.

Summarizing some of @AdamRJensen's thoughts here: the first is simple but adds a bit of clutter to the functions (solarposition.py imports ephem four separate times) and doesn't make it easy to see what external packages each pvlib module uses. The second keeps imports at the top of the file but is a little opaque and gets clunky if more than one function is needed from a single import.

This PR extends the second approach, but with improvements:

  • use __getattr__ so there's no need to explicitly list out the module functions of interest. For example the ERA5 PR wouldn't have to handle open_dataset and open_mfdataset separately like this
  • less boilerplate in the modules because the machinery is defined in pvlib.tools, so each module doesn't need to define a new class for each import

What do people think? Too "clever"? Solving a problem that doesn't need solved? There might be other places to use this too, but I only looked at ecwmf_macc.py and solarposition.py for now to illustrate its usage.

@wholmgren
Copy link
Member

Too "clever"? Solving a problem that doesn't need solved?

Sorry to be a downer, but I'd say yes and yes. I also understand that others may reasonably disagree.

and doesn't make it easy to see what external packages each pvlib module uses.

This could be solved with comments near the external imports at the top of the file e.g. # some functions import the optional dependency ephem

@AdamRJensen
Copy link
Member

AdamRJensen commented Aug 17, 2021

I agreed that it is quite clever/complex, but I like the fact that the 'cleverness' is contained in one function that most developers don't necessarily need to understand (like me). However, the current solution requires each function to have some slightly less clever method (e.g, as with get_ecmwf_macc) which is less useful as it only covers the specific module used in the library (and not all instances).

As a newbie contributor to pvlib, I would have preferred to add from pvlib.tools import _optional_import and use _optional_import than the 13 lines I ended up writing in #1264 instead.

What I also quite like about the solution is that all module imports are kept at the top (which is where I go to look for dependencies) instead of numerous times throughout the function as in the case of solarposition.py.

Just a few thoughts from this side of the pond :)

@mikofski
Copy link
Member

mikofski commented Aug 18, 2021

Even tho I once thought it better for all imports at top, I agree with Will that the mocking of functions in try: except at module top is too clever and IMHO may have some downsides:

  • I haven't seen a standard, or a pattern in other libraries. This means we'd be breaking new ground
  • in the past I've had some bugs b/c of these, specifically in singldiode.py
  • mocking imported modules may delay execution, but IMO it's better to fail ASAP when a function with optional import dependency is called
  • as a rule I believe monkeypatching is an anti pattern that is discouraged

An alternate solution which might be slightly less clever would be a decorator & still commenting at module top:

from pvlib.tools import import_optional_dependency
# has optional dependency import

@import_optional_dependency("foobar")
def myfun():
    return


# in tools, untested psuedocode
import imp  # or importlib whatever
def import_optional_dependency(imports):
    def wrapper(f):
        def wrapt_fcn(*args, **kwargs):
            imp.import(imports)
                return f(*args, **kwargs)
        return wrapt_fcn
# end wrapper?

A cool side effect is that if you see a module that imports this tool then you know already the are optional dependencies imported

@cwhanse
Copy link
Member

cwhanse commented Aug 18, 2021

I'm going to inject a related question: can we move tables off the optional list? Its required to run the default clearsky model, and, it's a pain to diagnose for users when their python environment can't find it.

@wholmgren
Copy link
Member

Some research on what other projects do:

pandas has a private module _optional.py that we could copy (with license attribution). Check out the git blame for all of the thought that has gone into it - that's part of why I want to keep it simple or at least avoid reinventing wheels. Note that pandas is fine with putting imports within functions e.g. lxml = import_optional_dependency("lxml.etree", errors="ignore"). They also track installed state for some optional dependencies: from pandas.core.computation.check import NUMEXPR_INSTALLED

siphon puts try/except ImportError statements within methods.

So does dask

I think this approach is also reasonable, while this one is overkill.

I agreed that it is quite clever/complex, but I like the fact that the 'cleverness' is contained in one function that most developers don't necessarily need to understand (like me).

Developer considerations are important, but multiple maintainers do need to understand it and I just don't have it in me at the moment.

@kandersolar
Copy link
Member Author

Fine with me to KISS and just use try/except ImportError inside the function like that siphon example and our ephem code.

What I also quite like about the solution is that all module imports are kept at the top (which is where I go to look for dependencies)

I was going to say that this is not a compelling point to me personally, but maybe that's because I already have a decent sense for what packages are used in each pvlib module. Or maybe I would just go to setup.py to see dependencies instead of reading module headers. Anyway would this be addressed by including a comment next to the required imports like @wholmgren suggested?

@AdamRJensen
Copy link
Member

Sounds good to me. I was never super attached to the new strategy, but figured it was worth discussing.

@AdamRJensen
Copy link
Member

AdamRJensen commented Aug 19, 2021

So for functions relying on a non-required package (e.g., xarray in get_era5 #1264) would it look something like this immediately after the function docstring:

try:
    import xarray as xr
except ImportError:
    raise ImportError('Reading ERA5 data requires xarray to be installed.')

Of course for modules where it's usage depends on the input parameters it would be moved to one or more suitable places in the code.

Also, is there a reason not to use the subclass ModuleNotFoundError or are we interested in catching all ImportErrors (the message "requires xarary to be installed" implies that it wasn't found and not that it couldn't be loaded)?

@kandersolar
Copy link
Member Author

It's not clear to me when ModuleNotFoundError is more appropriate than ImportError, and googling didn't really help. I looked at what some other packages use in this situation and found a mix:

In the absence of any better information I'd say raise ImportError since that's what the rest of pvlib already does.

@mikofski
Copy link
Member

from zen of python:

Simple is better than complex.

and

In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.

which I interpret to mean, try to observe the most-accepted and widely used methods because:

Readability counts.
Special cases aren't special enough to break the rules.

@kandersolar
Copy link
Member Author

Thanks all for the discussion, I think we can close this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants