-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Data structure(s) for PV measurement datasets such as I-V curves #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@thunderfish24 I support this effort. FYI - we're planning a more formal taxonomy development as part of Orange Button that has these data within it's scope, but that's a year out. Some thoughts about python structure:
|
Xarrays might work, but would suggest a simple 2D table (in Pandas) with columns such as:
You can then use query to extract subsets and/or reindex/pivot/unstack to rearrange the data for analysis. I tend to see units as meta data. (The suggestion above is for internal python storage/data structures to facilitate analysis, which the question seems to be focusing on. External/file storage options would ideally not be python-specific.) |
netCDF might be a good option here. This may be complicated enough and have enough different use cases that it's worth adopting a standard, well-supported framework instead of developing our own. In some of our research projects, trying to apply our half-baked data models to the netCDF format revealed gaps in how we were thinking about the problem. You could wrap up the netCDF data with a custom class, if desired. |
Thanks everyone for the thoughtful responses! After looking a bit further, I would say that NetCDF has the most potential and the best "alignment" to this particular dataset that I have in mind. (pandas' Dataframe still seems less well aligned to me, but it's probably doable.) With netCDF, the data formatting and storage would be more standardized than what I suggested, and I agree that units are best kept in the metadata and out of the data's indexes. It seems that the standardization part with netCDF involves coming up with "conventions", analogous to the CF conventions and its units, that cover the various PV performance measurement/monitoring cases. (Perhaps one aligns the conventions with the eventual Orange Button taxonomy?) The devil might be in the details, though, so I am going to report back after I try to actually implement netCDF for my particular data at hand. Regarding some of @cwhanse 's more specific comments:
|
I forgot to mention CF conventions! Developing similar conventions for pv data in netCDF format could be a high impact project. |
what's a CF convention? |
@mikofski Do you have any insight on how we might integrate, for example, netCDF files with (https://pvfree.alwaysdata.net/ is down ATM, so I couldn't look at the user interface to the existing data.) |
Responding to: #469 (comment) |
Actually this nomenclature is not suitable because of the Greek letters, subscripts and commas. The newer version of this standard (2017) is no better, nor is the variable list on the PVPMC website. Or maybe I misinterpreted your suggestion? |
@adriesse I had forgotten about that list at https://pvpmc.sandia.gov/resources-and-events/variable-list/ We compiled that list consistent with the notation on https://pvpmc.sandia.gov and the PVLib for MATLAB toolbox. But it's consistent with notation generally in PV-related literature, e.g., alpha for temperature coefficient for current. Do you see the spelled-out greek letters as an issue for column naming? |
Not a fundamental issue. But there aren't that many of them and they get reused a lot. Typesetting and programming have different needs and constraints for notation. |
We also keep this related list in our own documentation: http://pvlib-python.readthedocs.io/en/latest/variables_style_rules.html I'd like to see more names fully spelled out, and fewer abbreviations and spelled-out greek letters. I'd vote for that in column/variable naming conventions for a data structure as well. |
|
Hi @thunderfish24 , sorry all, unfortunately some of my GitHub notifications have been going to spam
>>> import requests
>>> r = requests.get('https://pvfree.herokuapp.com/api/v1/pvmodule/?format=json&Name__icontains=Canadian%20Solar')
>>> import pprint
>>> pprint.pprint(r.json())
{'meta': {'limit': 20,
'next': None,
'offset': 0,
'previous': None,
'total_count': 2},
'objects': [{'A': -3.40641,
'A0': 0.928385,
'A1': 0.068093,
'A2': -0.0157738,
'A3': 0.0016606,
'A4': -6.93e-05,
'Aimp': 0.000181,
'Aisc': 0.000397,
'Area': 1.701,
'B': -0.0842075,
'B0': 1.0,
'B1': -0.002438,
'B2': 0.0003103,
'B3': -1.246e-05,
'B4': 2.11e-07,
'B5': -1.36e-09,
'Bvmpo': -0.235488,
'Bvoco': -0.21696,
'C0': 1.01284,
'C1': -0.0128398,
'C2': 0.279317,
'C3': -7.24463,
'C4': 0.996446,
'C5': 0.003554,
'C6': 1.15535,
'C7': -0.155353,
'Cells_in_Series': 96,
'DTC': 3.0,
'FD': 1.0,
'IXO': 4.97599,
'IXXO': 3.18803,
'Impo': 4.54629,
'Isco': 5.09115,
'Material': 10,
'Mbvmp': 0.0,
'Mbvoc': 0.0,
'N': 1.4032,
'Name': 'Canadian Solar CS5P-220M [ 2009]',
'Notes': 'Source: Sandia National Laboratories Updated 9/25/2012 '
'Module Database',
'Parallel_Strings': 1,
'Vintage': '2009-01-01',
'Vmpo': 48.3156,
'Voco': 59.2608,
'id': 114,
'is_vintage_estimated': False,
'resource_uri': '/api/v1/pvmodule/114/'},
{'A': -3.6024,
'A0': 0.9371,
'A1': 0.06262,
'A2': -0.01667,
'A3': 0.002168,
'A4': -0.0001087,
'Aimp': -0.0001,
'Aisc': 0.0005,
'Area': 1.91,
'B': -0.2106,
'B0': 1.0,
'B1': -0.00789,
'B2': 0.0008656,
'B3': -3.298e-05,
'B4': 5.178e-07,
'B5': -2.918e-09,
'Bvmpo': -0.1634,
'Bvoco': -0.1532,
'C0': 1.0121,
'C1': -0.0121,
'C2': -0.171,
'C3': -9.397451,
'C4': None,
'C5': None,
'C6': None,
'C7': None,
'Cells_in_Series': 72,
'DTC': 3.2,
'FD': 1.0,
'IXO': None,
'IXXO': None,
'Impo': 8.1359,
'Isco': 8.6388,
'Material': 10,
'Mbvmp': 0.0,
'Mbvoc': 0.0,
'N': 1.0025,
'Name': 'Canadian Solar CS6X-300M [2013]',
'Notes': 'Source: CFV Solar Test Lab. Tested 2013. Module '
'13022-08',
'Parallel_Strings': 1,
'Vintage': '2013-01-01',
'Vmpo': 34.9531,
'Voco': 43.5918,
'id': 518,
'is_vintage_estimated': False,
'resource_uri': '/api/v1/pvmodule/518/'}]} and >>> r = requests.get('https://pvfree.herokuapp.com/api/v1/pvinverter/?format=json&Name__icontains=PVP&Paco__exact=260000')
>>> pprint.pprint(r.json())
{'meta': {'limit': 20,
'next': None,
'offset': 0,
'previous': None,
'total_count': 4},
'objects': [{'C0': -1.07933e-07,
'C1': 1.88514e-05,
'C2': 0.00151279,
'C3': -0.000697514,
'Idcmax': 791.29,
'Mppt_high': 480.0,
'Mppt_low': 295.0,
'Name': 'PV Powered: PVP260KW [480V] 480V [CEC 2018]',
'Paco': 260000.0,
'Pdco': 269830.0,
'Pnt': 67.0,
'Pso': 1006.34,
'Vac': 480.0,
'Vdcmax': 480.0,
'Vdco': 341.0,
'created_on': '2018-05-09',
'id': 3847,
'modified_on': '2018-05-09',
'resource_uri': '/api/v1/pvinverter/3847/'},
{'C0': -1.35676e-07,
'C1': 2.54289e-05,
'C2': 0.00206057,
'C3': -0.000253737,
'Idcmax': 849.99,
'Mppt_high': 480.0,
'Mppt_low': 265.0,
'Name': 'PV Powered: PVP260KW-LV [480V] 480V [CEC 2018]',
'Paco': 260000.0,
'Pdco': 271147.0,
'Pnt': 67.0,
'Pso': 1086.2,
'Vac': 480.0,
'Vdcmax': 480.0,
'Vdco': 319.0,
'created_on': '2018-05-09',
'id': 3848,
'modified_on': '2018-05-09',
'resource_uri': '/api/v1/pvinverter/3848/'},
{'C0': -1.03e-07,
'C1': 2.05e-05,
'C2': 0.00203,
'C3': -0.000443,
'Idcmax': 925.0,
'Mppt_high': 600.0,
'Mppt_low': 295.0,
'Name': 'PV Powered: PVP260kW 480V [CEC 2009]',
'Paco': 260000.0,
'Pdco': 270057.3609,
'Pnt': 67.0,
'Pso': 893.1837948,
'Vac': 480.0,
'Vdcmax': 600.0,
'Vdco': 343.3983333,
'created_on': '2018-05-09',
'id': 3849,
'modified_on': '2018-05-09',
'resource_uri': '/api/v1/pvinverter/3849/'},
{'C0': -1.33e-07,
'C1': 2.79e-05,
'C2': 0.00273,
'C3': 0.000131,
'Idcmax': 1030.0,
'Mppt_high': 600.0,
'Mppt_low': 265.0,
'Name': 'PV Powered: PVP260kW-LV 480V [CEC 2009]',
'Paco': 260000.0,
'Pdco': 271537.9777,
'Pnt': 67.0,
'Pso': 929.7589628,
'Vac': 480.0,
'Vdcmax': 600.0,
'Vdco': 322.2183333,
'created_on': '2018-05-09',
'id': 3850,
'modified_on': '2018-05-09',
'resource_uri': '/api/v1/pvinverter/3850/'}]}
import numpy as np
import pvlib
x = pvlib.pvsystem.singlediode(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
y = pvlib.pvsystem.singlediode(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
my_dtype = np.dtype([
('i_l', float), ('i_0', float), ('r_s', float), ('r_sh', float), ('nNsVth', float),
('i_sc', float), ('v_oc', float), ('i_mp', float), ('v_mp', float),
('i', float, (1,100)), ('v', float, (1, 100))
])
my_data = np.array([
(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
x['i_sc'], x['v_oc'], x['i_mp'], x['v_mp'], x['i'], x['v']),
(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
y['i_sc'], y['v_oc'], y['i_mp'], y['v_mp'], y['i'], y['v'])
], my_dtype)
my_data['i_l'] # list of all photogenerated currents (`I_L`)
# array([6.1, 5.1])
my_data['i'][0]
# list of cell currents for first record I'm not endorsing this, just making sure you all are aware of it. But note how you can make the cell current and voltage fields 1x100 since we know we'll set your_data = np.copy(my_data)
# reshape my_data and your_data from (2,) to (1, 2), and concatentate
# you could also use np.atleast_2d or np.tile probably, ...
# lots of options here, not sure best ...
all_data = np.concatenate([my_data.reshape(1,2), your_data.reshape(1,2)], axis=0)
all_data.shape
# (2, 2)
# now "fancy" indexing to get i-v curve at (E, T)
all_data[1, 1][['i', 'v']]
([[5.09950248e+00, 5.09674358e+00, 5.09398468e+00, 5.09122577e+00, 5.08846685e+00, ...]],
[[ 0. , 0.339375 , 0.67875 , 1.01812499, 1.35749999, ...]]) then plot import matplotlib.pyplot as plt
plt.ion()
v, i = all_data[1, 1][['v', 'i']]
plt.plot(v.flat, i.flat)
plt.grid()
plt.title('i-v curve at (E, T) from NumPy structured arrays')
plt.xlabel('voltage, V')
plt.ylabel('current, I') You could easily plot families of curves this same way.
|
@adriesse It turns out pandas' multi-indexing dataframe is a "natural" solution (at least for my way of thinking) as well as mapping 1-1 to the underlying spreadsheet template that we are using for data collection. Below is an example with a matrix of fake I-V-F-H curves, each with 10 points. In the case with curves with differing numbers of points, some "trailing" missing values would be NaN and they would need to be accommodated, which could lead to significant wasted memory in some datasets. This also doesn't address the question of how to best store the dataframe. import pandas as pd
index = pd.MultiIndex.from_product([['0.1', '0.2', '0.4', '0.6', '0.8', '1.0', '1.1'],
['15', '25', '50'],
['v_V', 'i_A', 'f', 'h']],
names=['f_nom', 't_degC_nom', 'channel'])
df = pd.DataFrame(np.random.randn(index.size, 10), index=index)
print(df.loc[(['0.8', '1.0', '1.1'], ['15', '25', '50'], ['v_V', 'i_A']),::2]) gives
|
As I mentioned, you can pivot 2-D data into other forms such as the one you show (see https://pandas.pydata.org/pandas-docs/stable/reshaping.html) to facilitate analysis or visualization. I sometimes do that, but other times I just do a query to get the subset I want, which I find more straightforward than the multi-index syntax, but is probably less efficient. |
Sorry if my previous comment too long and meandering.
I should add that serializing and deserializing my example with h5py is trivial: import numpy as np
import pvlib
import h5py
# create some data
x = pvlib.pvsystem.singlediode(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
y = pvlib.pvsystem.singlediode(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026, 100)
# set the dtypes to use as a structured array
my_dtype = np.dtype([
('i_l', float), ('i_0', float), ('r_s', float), ('r_sh', float), ('nNsVth', float),
('i_sc', float), ('v_oc', float), ('i_mp', float), ('v_mp', float),
('i', float, (1,100)), ('v', float, (1, 100))
])
# store the data in structured array, note that the IV curve is a nested
my_data = np.array([
(6.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
x['i_sc'], x['v_oc'], x['i_mp'], x['v_mp'], x['i'], x['v']),
(5.1, 1.2e-7, 0.012, 123, 1.23*60*0.026,
y['i_sc'], y['v_oc'], y['i_mp'], y['v_mp'], y['i'], y['v'])
], my_dtype)
# pretend that this is a grid of IV curves for matrix of (E, T)
your_data = np.copy(my_data)
# reshape my_data and your_data from (2,) to (1, 2),
# and concatentate to make fake grid
all_data = np.concatenate([my_data.reshape(1,2), your_data.reshape(1,2)], axis=0)
# output to a file
with h5py.File('THIS_IS_A_TEST_FILE.H5', 'w') as f:
f['data'] = all_data # key "data" is arbitrary, choose as many groups as you need quit python and restart import h5py
import numpy as np
# retrieve the data from file
with h5py.File('THIS_IS_A_TEST_FILE.H5', 'r') as f:
all_data = np.array(f['data'])
# do some fancy indexing:
all_data[1,1][['i', 'v']]
# ([[5.09950248e+00, 5.09674358e+00, ..., 1.41340468e+00, 7.65749212e-01, 7.99360578e-15]],
# [[ 0. , 0.339375 , 0.67875 , ..., 32.91937481, 33.2587498 , 33.5981248 ]]) use record arrays instead of structured: all_data_rec = np.rec.array(all_data) # as record array
all_data_rec.i_l
# array([[6.1, 5.1],
# [6.1, 5.1]]) AFAICT the only difference between structured and record arrays is the ability to use attributes for column names instead of fields. |
@mikofski what might change if we want to store multiple IV curves, and the |
Are we still (or were we ever) discussing a pvlib enhancement? If no, let's at least close the issue if not move it elsewhere. |
At the moment, the discussion is relevant to the demonstration data for #229 and possibly to whatever we do with #511. I'm OK closing this as an issue, and taking up the discussion when we have a specific implementation to review. I'd rather see a pull request targeting |
@mikofski You have convinced me to take a closer look at numpy's structured/record array :). The alternative that I choose (pandas vs. numpy) will mostly rely on which "feels" more lightweight and natural in terms of things like complex slicing, concatenation, dealing with I-V curves of different lengths, and handling repeated measurements. Oh, did I mention that I also have normal-incidence QE's at three temperatures for this dataset too? I don't see any big issues saving either alternative to HDF5, but I do need to further investigate the storage of meta-data such as channel units as well as settle upon the names (and maybe a standards effort would ultimately prefer netCDF with PV-specific "conventions"). Finally, do you know if it makes sense to transfer the HDF5 over the wire for a REST API, or would you anticipate a server-side JSON conversion? @adriesse Pandas' pivoting is impressive and thanks for bringing that tool to my attention. I'm hoping that the "raw" data structure can be organized (at least for the IEC 61853-1 use case) such that it could be readily "understood" by a human who loads it out of storage and displays the data object for the first time, and it seems like the multi-index setup accomplishes that well. @wholmgren I will close this issue now, but @cwhanse please reference this use case as the Green Button initiative gets underway. |
I'm looking for input on using/creating a "standard" data structure to store PV measurement datasets, such as I-V curves supporting IEC 61853-1. I'm thinking of something flexible/extensible and self-documenting (esp. w.r.t. units). This space also seems to intersect with time-series of I-V curves and maybe PECOS workflows.
For example, I have a collection of I-V-F-T curves, each with possibly varying numbers of points, that are each taken at a "nominal" matrix of effective irradiance F = 0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.1 (unitless) and temperature T = 15, 25, 50 degC. Sticking to just python and numpy (pandas doesn't seem like the right fit here), I came up with this dict-based structure:
In this case, I could retrieve the currents vector for a particular curve using
data[('0.2', '15 degC')]['i_A']
. I also need to concatenate (in a consistent order) all the currents, voltages, etc. from all the curves together. One could also imagine repeated I-V-F-T curve measurements at each nominal setting (with possibly a different number of points in each repetition).The ordered-pair keys can also be sorted in various ways using
sorted()
, as long as the chosen strings don't cause ordering problems. Note that replacing the keys with timestamps would produce time-series I-V-F-T curve data.The text was updated successfully, but these errors were encountered: