-
Notifications
You must be signed in to change notification settings - Fork 40
Add example to create a virtual dataset using lithops #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
TomNicholas
merged 11 commits into
zarr-developers:main
from
thodson-usgs:lithops-to-kerchunk-example
Sep 5, 2024
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
e526622
Set ZArray fill_value back to nan
thodson-usgs c21b171
Set NaT as datetime64 default fill value
thodson-usgs 6be97df
Add example to create a virtual dataset using lithops
thodson-usgs 8e301ef
Rename file
thodson-usgs 0a7ee8d
Merge branch 'main' into lithops-to-kerchunk-example
thodson-usgs 520bea2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 2ad483a
Update examples/virtualizarr-with-lithops/README.md
thodson-usgs ab13915
Update examples/virtualizarr-with-lithops/README.md
thodson-usgs cd361d4
Update examples/virtualizarr-with-lithops/README.md
thodson-usgs 3c35bbc
Merge branch 'main' into lithops-to-kerchunk-example
TomNicholas f777ccf
Merge branch 'main' into lithops-to-kerchunk-example
thodson-usgs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
59 changes: 59 additions & 0 deletions
59
examples/virtualizarr-with-lithops/Dockerfile_virtualizarr
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Python 3.11 | ||
FROM python:3.11-slim-buster | ||
|
||
|
||
RUN apt-get update \ | ||
# Install aws-lambda-cpp build dependencies | ||
&& apt-get install -y \ | ||
g++ \ | ||
make \ | ||
cmake \ | ||
unzip \ | ||
# cleanup package lists, they are not used anymore in this image | ||
&& rm -rf /var/lib/apt/lists/* \ | ||
&& apt-cache search linux-headers-generic | ||
|
||
ARG FUNCTION_DIR="/function" | ||
|
||
# Copy function code | ||
RUN mkdir -p ${FUNCTION_DIR} | ||
|
||
# Update pip | ||
# NB botocore/boto3 are pinned due to https://github.com/boto/boto3/issues/3648 | ||
# using versions from https://github.com/aio-libs/aiobotocore/blob/72b8dd5d7d4ef2f1a49a0ae0c37b47e5280e2070/setup.py | ||
# due to s3fs dependency | ||
RUN pip install --upgrade --ignore-installed pip wheel six setuptools \ | ||
&& pip install --upgrade --no-cache-dir --ignore-installed \ | ||
awslambdaric \ | ||
botocore==1.29.76 \ | ||
boto3==1.26.76 \ | ||
redis \ | ||
httplib2 \ | ||
requests \ | ||
numpy \ | ||
scipy \ | ||
pandas \ | ||
pika \ | ||
kafka-python \ | ||
cloudpickle \ | ||
ps-mem \ | ||
tblib | ||
|
||
# Set working directory to function root directory | ||
WORKDIR ${FUNCTION_DIR} | ||
|
||
# Add Lithops | ||
COPY lithops_lambda.zip ${FUNCTION_DIR} | ||
RUN unzip lithops_lambda.zip \ | ||
&& rm lithops_lambda.zip \ | ||
&& mkdir handler \ | ||
&& touch handler/__init__.py \ | ||
&& mv entry_point.py handler/ | ||
|
||
# Put your dependencies here, using RUN pip install... or RUN apt install... | ||
|
||
COPY requirements.txt requirements.txt | ||
RUN pip install --no-cache-dir -r requirements.txt | ||
|
||
ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ] | ||
CMD [ "handler.entry_point.lambda_handler" ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Generate a virtual zarr dataset using lithops | ||
|
||
This example walks through how to create a virtual dataset from a collection of | ||
netCDF files on s3 using lithops to open each file in parallel then concatenate | ||
them into a single virtual dataset. | ||
|
||
## Credits | ||
Inspired by Pythia's cookbook: https://projectpythia.org/kerchunk-cookbook | ||
by norlandrhagen. | ||
|
||
Please, contribute improvements. | ||
|
||
|
||
|
||
1. Set up a Python environment | ||
```bash | ||
conda create --name virtualizarr-lithops -y python=3.11 | ||
conda activate virtualizarr-lithops | ||
pip install -r requirements.txt | ||
``` | ||
|
||
2. Configure compute and storage backends for [lithops](https://lithops-cloud.github.io/docs/source/configuration.html). | ||
The configuration in `lithops.yaml` uses AWS Lambda for [compute](https://lithops-cloud.github.io/docs/source/compute_config/aws_lambda.html) and AWS S3 for [storage](https://lithops-cloud.github.io/docs/source/storage_config/aws_s3.html). | ||
To use those backends, simply edit `lithops.yaml` with your `bucket` and `execution_role`. | ||
|
||
1. Build a runtime image for Cubed | ||
```bash | ||
export LITHOPS_CONFIG_FILE=$(pwd)/lithops.yaml | ||
lithops runtime build -b aws_lambda -f Dockerfile_virtualizarr virtualizarr-runtime | ||
``` | ||
|
||
1. Run the script | ||
```bash | ||
python virtualizarr-with-lithops.py | ||
``` | ||
|
||
## Cleaning up | ||
To rebuild the Lithops image, delete the existing one by running | ||
```bash | ||
lithops runtime delete -b aws_lambda -d virtualizarr-runtime | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
lithops: | ||
backend: aws_lambda | ||
storage: aws_s3 | ||
|
||
aws: | ||
region: us-west-2 | ||
|
||
aws_lambda: | ||
execution_role: arn:aws:iam::807615458658:role/lambdaLithopsExecutionRole | ||
runtime: virtualizarr-runtime | ||
runtime_memory: 2000 | ||
|
||
aws_s3: | ||
bucket: arn:aws:s3:::cubed-thodson-temp |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
boto | ||
cftime | ||
h5py | ||
kerchunk | ||
lithops | ||
s3fs | ||
virtualizarr | ||
xarray |
59 changes: 59 additions & 0 deletions
59
examples/virtualizarr-with-lithops/virtualizarr-with-lithops.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Use lithops to create a virtual dataset from a collection of necdf files on s3. | ||
# | ||
# Inspired by Pythia's cookbook: https://projectpythia.org/kerchunk-cookbook | ||
# by norlandrhagen. | ||
# | ||
# Please, contribute improvements. | ||
|
||
import fsspec | ||
import lithops | ||
import xarray as xr | ||
|
||
from virtualizarr import open_virtual_dataset | ||
|
||
# to demonstrate this workflow, we will use a collection of netcdf files from the WRF-SE-AK-AR5 project. | ||
fs_read = fsspec.filesystem("s3", anon=True, skip_instance_cache=True) | ||
files_paths = fs_read.glob("s3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/*") | ||
file_pattern = sorted(["s3://" + f for f in files_paths]) | ||
|
||
# optionally, truncate file_pattern while debugging | ||
# file_pattern = file_pattern[:4] | ||
|
||
print(f"{len(file_pattern)} file paths were retrieved.") | ||
|
||
|
||
def map_references(fil): | ||
"""Map function to open virtual datasets.""" | ||
vds = open_virtual_dataset( | ||
fil, | ||
indexes={}, | ||
loadable_variables=["Time"], | ||
cftime_variables=["Time"], | ||
) | ||
return vds | ||
|
||
|
||
def reduce_references(results): | ||
"""Reduce to concat virtual datasets.""" | ||
combined_vds = xr.combine_nested( | ||
results, | ||
concat_dim=["Time"], | ||
coords="minimal", | ||
compat="override", | ||
) | ||
return combined_vds | ||
|
||
|
||
fexec = lithops.FunctionExecutor(config_file="lithops.yaml") | ||
|
||
futures = fexec.map_reduce( | ||
map_references, | ||
file_pattern, | ||
reduce_references, | ||
spawn_reducer=100, | ||
) | ||
|
||
ds = futures.get_result() | ||
|
||
# write out the virtual dataset to a kerchunk json | ||
ds.virtualize.to_kerchunk("combined.json", format="json") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason not to use
file
instead offil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must've copied that directly from an example. I'll need to check whether this follows some convention or just a typo.