-
Notifications
You must be signed in to change notification settings - Fork 11
Description
There are rare cases in the dataset where the data_min and data_max columns in the catalog don't match the min/max measured from the actual (decoded) images.
For example, event R19011212048075 for img_type='ir069'. This entry in the CATALOG.csv is
id R19011212048075
file_name ir069/2019/SEVIR_IR069_RANDOMEVENTS_2019_0101_...
file_index 821
img_type ir069
time_utc 2019-01-12 12:00:00
minute_offsets -120:-115:-110:-105:-100:-95:-90:-85:-80:-75:-...
episode_id NaN
event_id NaN
event_type NaN
llcrnrlat 38.9436
llcrnrlon -92.3178
urcrnrlat 42.0725
urcrnrlon -87.3715
proj +proj=laea +lat_0=38 +lon_0=-98 +units=m +a=63...
size_x 192
size_y 192
height_m 384000
width_m 384000
!data_min -23540.1
!data_max 22.877
pct_missing 0
Name: 39505, dtype: objectThe minimum value in this case is -23540.1 degrees C, which is strange value. And if we actually look at the minimum in the image stored in SEVIR, we see a value of -18312, which decodes to -183.12. That's different than what's reported above.
Explanation
Looking at the data, this happens when there are a few bad pixels in the image, typically in very high and thick clouds:
Data is converted to int16 before being written to .h5, however the min/max values entered in the CATALOG are recorded before this casting is done. In cases of bad pixels, these values get very large (as what happened in this case), and the true minimum of the data causes and int16 overflow when scaled. So the pixel value stored for these bad pixels in SEVIR is garbage (as is the value stored in the CATALOG).
Unfortunately, this cannot be fixed easily without recreating the whole dataset. A good practice would be in preprocessing to clip pixels to a physically reasonable range computed by filtering out outliers like this one.
