You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/how-to/scientific-data/landscape-guide.md
+20-20Lines changed: 20 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,23 +1,23 @@
1
1
---
2
-
title: Scientific Data and IPFS Landscape Guide
2
+
title: Scientific data and IPFS landscape guide
3
3
description: an overview of the problem space, available tools, and architectural patterns for publishing and working with scientific data using IPFS.
4
4
---
5
5
6
-
# Scientific Data and IPFS Landscape Guide
6
+
# Scientific data and IPFS landscape guide
7
7
8
8
Scientific data and IPFS are naturally aligned: research teams need to share large datasets across institutions, verify data integrity, and ensure resilient access. From sensor networks to global climate modeling efforts, scientific communities are using IPFS content addressing and peer-to-peer distribution to solve problems traditional infrastructure can't.
9
9
10
10
In this guide, you'll find an overview of the problem space, available tools, and architectural patterns for publishing and working with scientific data using IPFS.
11
11
12
-
## A Landscape in Flux
12
+
## A landscape in flux
13
13
14
14
Science advances through collaboration, yet the infrastructure for sharing scientific data has historically developed in silos. Different fields adopted different formats, metadata conventions, and distribution mechanisms.
15
15
16
16
This fragmentation means there is no single "right way" to publish and share scientific data. Instead, this is an area of active innovation, with new tools and conventions emerging as communities identify common needs. Standards like [Zarr](https://zarr.dev) represent convergence points where different fields have found common ground.
17
17
18
18
This guide surveys the landscape and available tooling, but the right approach for your project depends on your specific constraints: the size and structure of your data, your collaboration patterns, your existing infrastructure, and your community's conventions. The goal is to help you understand the options so you can make informed choices.
19
19
20
-
## The Nature of Scientific Data
20
+
## The nature of scientific data
21
21
22
22
Scientific data originates from a variety of sources. In the geospatial field, data is collected by sensors, measuring instruments, camera systems, and satellites. This data is commonly structured as multidimensional arrays (tensors), representing measurements across dimensions like time, latitude, longitude, and altitude.
23
23
@@ -28,7 +28,7 @@ Key characteristics of scientific data include:
28
28
-**Metadata-rich**: Extensive contextual information accompanies the raw measurements
29
29
-**Collaborative**: Research often involves multiple institutions and scientists sharing and building upon datasets
30
30
31
-
## The Importance of Open Data Access
31
+
## The importance of open data access
32
32
33
33
As hinted above, open access to scientific data accelerates research, enables reproducibility, and maximizes the return on public investment in science. Organizations worldwide have recognized this, leading to mandates for open data sharing in publicly funded research.
34
34
@@ -42,7 +42,7 @@ These criteria are by no means exhaustive, for example initiatives like [FAIR](h
42
42
43
43
With that in mind, the next section will look at how these ideas come together with IPFS.
44
44
45
-
## The Benefits of IPFS for Scientific Data
45
+
## The benefits of IPFS for scientific data
46
46
47
47
IPFS addresses several pain points in scientific data distribution:
48
48
@@ -53,7 +53,7 @@ IPFS addresses several pain points in scientific data distribution:
53
53
54
54
To get a better sense of how these ideas which are central to IPFS' design are applied by the scientific community, it's worth looking at the [ORCESTRA Campaign Case Study](../../case-studies/orcestra.md) campaign, which uses IPFS to reap these benefits.
55
55
56
-
## Architectural Patterns
56
+
## Architectural patterns
57
57
58
58
### CID-centric verifiable data management
59
59
@@ -72,34 +72,34 @@ Ultimately the choice between these approaches for content-addressed data manage
72
72
- How important is it to maintain a copy of the data in a content-addressed format? If no public publishing is expected and you only need integrity checks, you may choose not to store a full content-addressed replica and instead compute hashes on demand.
73
73
- What libraries and which programming languages will you use to interact with the data? For example, Python’s xarray library, via fsspec, can read directly from a local IPFS gateway using [`ipfsspec`](https://github.com/fsspec/ipfsspec).
74
74
75
-
### Single Publisher
75
+
### Single publisher
76
76
77
77
A single institution runs Kubo nodes to publish and provide data. Users retrieve via gateways or their own nodes.
78
78
79
-
### Collaborative Publishing
79
+
### Collaborative publishing
80
80
81
81
Multiple institutions coordinate to provide the same datasets:
82
82
83
83
- Permissionless: single writer multiple follower providers
84
84
- Coordination can happen out of band, for example via a shared pinset on GitHub. The original publisher must ensure their data is provided, but once it's added to the pinset, others can replicate it.
85
85
86
-
### Connecting to Existing Infrastructure
86
+
### Connecting to existing infrastructure
87
87
88
88
IPFS can complement existing data infrastructure:
89
89
90
90
- STAC catalogs can include IPFS CIDs alongside traditional URLs
91
91
- Data portals can offer IPFS as an alternative retrieval method
92
92
- CI/CD pipelines can automatically add new data to IPFS nodes
93
93
94
-
## Geospatial Format Evolution: From NetCDF to Zarr
94
+
## Geospatial format evolution: from NetCDF to Zarr
95
95
96
96
The scientific community has long relied on formats like NetCDF, HDF5, and GeoTIFF for storing multidimensional n-array data (also referred to as tensors). While these formats served research well, they were designed for local filesystems and face challenges in cloud and distributed environments, that have become the norm over the last decades. This has been a trend driven by both the size of datasets growing and the advent of cloud and distributed systems enabling the storage and processing of larger volumes of data.
97
97
98
-
### Limitations of Traditional Formats
98
+
### Limitations of traditional formats
99
99
100
100
NetCDF and HDF5 interleave metadata with data, requiring large sequential reads to access metadata before reaching the data itself. This creates performance bottlenecks when accessing data over networks, whether that's cloud storage or a peer-to-peer network.
101
101
102
-
### The Rise of Zarr
102
+
### The rise of Zarr
103
103
104
104
[Zarr](https://zarr.dev/) has emerged as a cloud-native format optimized for distributed storage:
105
105
@@ -146,11 +146,11 @@ Metadata in scientific datasets serves to make the data self-describing, like wh
146
146
147
147
[**GeoZarr**](https://github.com/zarr-developers/geozarr-spec) is a specification for storing geospatial raster/grid data in the Zarr format. It defines conventions for how to encode coordinate reference systems, spatial dimensions, and other geospatial metadata within Zarr stores. It's conceptually downstream of the ideas in CF CDM (from the [netCDF ecosystem](https://docs.unidata.ucar.edu/netcdf-java/5.2/userguide/common_data_model_overview.html)), but designed for the Zarr ecosystem.
148
148
149
-
## Ecosystem Tooling
149
+
## Ecosystem tooling
150
150
151
-
### Organizing Content-Addressed Data
151
+
### Organizing content-addressed data
152
152
153
-
#### UnixFS and CAR Files
153
+
#### UnixFS and CAR files
154
154
155
155
UnixFS is the default format for representing files and directories in IPFS. It chunks large files for incremental verification and parallel retrieval.
156
156
@@ -181,7 +181,7 @@ To learn more about how to use MFS to organize your data, check out the guide on
181
181
[IPFS Cluster](https://ipfscluster.io/) is a cluster solution built on top of Kubo for multi-node deployments. IPFS Cluster coordinates pinning across a set of Kubo nodes, ensuring data redundancy and availability.
182
182
Support for the [Pinning API spec](https://ipfs.github.io/pinning-services-api-spec/).
183
183
184
-
#### Pinning Services
184
+
#### Pinning services
185
185
186
186
Third-party pinning services provide managed infrastructure for persistent storage, useful when you don't want to run your own nodes.
187
187
TODO: link to pinning services list in docs
@@ -201,7 +201,7 @@ ds = xr.open_dataset(
201
201
)
202
202
```
203
203
204
-
### Discovery, Metadata, and Data Portals: From discovery all the way to retrieval
204
+
### Discovery, metadata, and data portals: from discovery all the way to retrieval
205
205
206
206
TODO: add an intro in the form of a user journey of a scientists looking for data, all the way to retrieving it.
207
207
@@ -212,7 +212,7 @@ Content Discovery is an loaded term that can mean related, albeit distinct conce
212
212
- Human-centric
213
213
-**Content discovery:** also commonly known as **content routing**, refers to finding providers (nodes serving the data) for a given CID, including their network addresses. By default, IPFS supports a number of content routing systems: the Amino DHT, IPNI and Delegated Routing over HTTP as a common interface for interoperability.
214
214
215
-
### CID Discovery
215
+
### CID discovery
216
216
217
217
When using content-addressed systems like IPFS, a new challenge emerges: how do users discover the Content Identifiers (CIDs) for datasets they want to access?
218
218
@@ -259,7 +259,7 @@ STAC has a web browser, making navigation discovery https://github.com/radiantea
259
259
260
260
-->
261
261
262
-
## Next Steps
262
+
## Next steps
263
263
264
264
-[Publishing Zarr Datasets with IPFS](./publish-geospatial-zarr-data.md) - A hands-on guide to publishing your first dataset
Copy file name to clipboardExpand all lines: docs/how-to/scientific-data/publish-geospatial-zarr-data.md
+26-26Lines changed: 26 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
---
2
-
title: Publish Geospatial Zarr Data with IPFS
2
+
title: Publish geospatial Zarr data with IPFS
3
3
description: Learn how to publish geospatial datasets using IPFS and Zarr for decentralized distribution, data integrity, and open access.
4
4
---
5
5
6
-
# Publish Geospatial Zarr Data with IPFS
6
+
# Publish geospatial Zarr data with IPFS
7
7
8
8
In this guide, you will learn how to publish public geospatial data sets using IPFS, with a focus on the [Zarr](https://zarr.dev/) format. You'll learn how to leverage decentralized distribution with IPFS for better collaboration, data integrity, and open access.
9
9
@@ -15,19 +15,19 @@ If you are interested in a real-world example following the patterns in this gui
15
15
16
16
-[Why IPFS for Geospatial Data?](#why-ipfs-for-geospatial-data)
17
17
-[Prerequisites](#prerequisites)
18
-
-[Step 1: Prepare Your Zarr Data Set](#step-1-prepare-your-zarr-data-set)
19
-
-[Step 2: Add Your Data Set to IPFS](#step-2-add-your-data-set-to-ipfs)
20
-
-[Step 3: Organizing Your Data](#step-3-organizing-your-data)
-[Option A: Share the CID directly](#option-a-share-the-cid-directly)
24
+
-[Option B: Use IPNS for updatable references](#option-b-use-ipns-for-updatable-references)
25
+
-[Option C: Use DNSLink for human-readable URLs](#option-c-use-dnslink-for-human-readable-urls)
26
+
-[Accessing published data](#accessing-published-data)
27
+
-[Choosing your approach](#choosing-your-approach)
28
28
-[Reference](#reference)
29
29
30
-
## Why IPFS for Geospatial Data?
30
+
## Why IPFS for geospatial data?
31
31
32
32
Geospatial data sets such as weather observations, satellite imagery, and sensor readings, are typically stored as multidimensional arrays, also commonly known as tensors.
33
33
@@ -58,14 +58,14 @@ Before starting, ensure you have:
58
58
59
59
- A Zarr data set ready for publishing
60
60
- Basic familiarity with the command line
61
-
-[Kubo](/install/command-line/) or [IPFS Desktop](/install/ipfs-desktop/) installed on a machine.
61
+
-[Kubo](../../install/command-line.md) or [IPFS Desktop](../../install/ipfs-desktop.md) installed on a machine.
62
62
63
63
:::callout
64
64
See the [NAT and port forwarding guide](../nat-configuration.md) for more information on how to configure port forwarding so that your IPFS node is publicly reachable, thus allowing reliable retrievability of data by other nodes.
65
65
66
66
:::
67
67
68
-
## Step 1: Prepare Your Zarr Data Set
68
+
## Step 1: Prepare your Zarr data set
69
69
70
70
When preparing your Zarr data set for IPFS, aim for approximately 1 MiB chunks to align with IPFS's 1 MiB maximum block size. While this is not a strict requirement, using larger Zarr chunks will cause IPFS to split them into multiple blocks, potentially increasing retrieval latency.
71
71
@@ -93,7 +93,7 @@ Chunking in Zarr is a nuanced topic beyond the scope of this guide. For more inf
93
93
94
94
:::
95
95
96
-
## Step 2: Add Your Data Set to IPFS
96
+
## Step 2: Add your data set to IPFS
97
97
98
98
Add your Zarr folder to IPFS using the `ipfs add` command:
99
99
@@ -117,9 +117,9 @@ This command:
117
117
118
118
The `--quieter` flag outputs only the root CID, which identifies the complete dataset.
119
119
120
-
> **Note:** Check out the [lifecycle of data in IPFS](../../../concepts/lifecycle.md) to learn more about how merkleizing, pinning, and providing work under the hood.
120
+
> **Note:** Check out the [lifecycle of data in IPFS](../../concepts/lifecycle.md) to learn more about how merkleizing, pinning, and providing work under the hood.
121
121
122
-
## Step 3: Organizing Your Data
122
+
## Step 3: Organizing your data
123
123
124
124
Two options help manage multiple datasets on your node:
125
125
@@ -186,7 +186,7 @@ ipfs files stat --hash /datasets/halo
186
186
187
187
`bafybeihqixf5ew7mfr74bzb74qiw2mgtnytabnpzjnf5xeejzq4p2ocygu` is a new CID representing the combined dataset containing all three HALO flight datasets. The original CIDs are referenced, not copied, so no data is duplicated.
188
188
189
-
## Step 4: Verify Providing Status
189
+
## Step 4: Verify providing status
190
190
191
191
After adding, Kubo continuously announces your content to the network. Check the status:
192
192
@@ -196,19 +196,19 @@ ipfs provide stat
196
196
197
197
For detailed diagnostics, see the [provide system documentation](https://github.com/ipfs/kubo/blob/master/docs/provide-stats.md).
198
198
199
-
## Step 5: Content Discovery
199
+
## Step 5: Content discovery
200
200
201
201
Now that your data is available on the public network, the next step is making it discoverable to others. Choose a sharing approach based on your needs:
If you want to share a stable identifier but be able to update the underlying dataset, create an [IPNS](https://docs.ipfs.tech/concepts/ipns/) identifier and share that instead. This is useful for datasets that get updated regularly — users can bookmark your IPNS name and always retrieve the latest version.
214
214
@@ -222,7 +222,7 @@ ipfs name publish /ipfs/<new-dataset-cid>
222
222
223
223
IPNS is supported by all the retrieval methods in the [Accessing Published Data](#accessing-published-data) section below. Keep in mind that IPNS name resolution adds latency to the retrieval process.
224
224
225
-
### Option C: Use DNSLink for Human-Readable URLs
225
+
### Option C: Use DNSLink for human-readable URLs
226
226
227
227
Link a DNS name to your CID by adding a TXT record:
228
228
@@ -236,11 +236,11 @@ Users can then access your data using one of the following methods:
236
236
- With Kubo: `ipfs cat /ipns/data.example.org/zarr.json`
237
237
- Using ipfsspec in Python as detailed below in [Python with ipfsspec](#python-with-ipfsspec), which also supports IPNS names, so you can use `ipns://data.example.org/zarr.json` directly.
238
238
239
-
## Accessing Published Data
239
+
## Accessing published data
240
240
241
241
Once published, users can access your Zarr datasets through multiple methods:
242
242
243
-
### IPFS HTTP Gateways
243
+
### IPFS HTTP gateways
244
244
245
245
See the [retrieval guide](../../quickstart/retrieve.md).
246
246
@@ -266,7 +266,7 @@ import { verifiedFetch } from '@helia/verified-fetch'
Copy file name to clipboardExpand all lines: docs/quickstart/retrieve.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -139,7 +139,7 @@ To fetch the CID using an IPFS gateway is as simple as loading one of the follow
139
139
140
140
In this quickstart guide, you learned the different approaches to retrieving CIDs from the IPFS network and how to pick the most appropriate method for your specific needs.
141
141
142
-
You then fetched the image that was pinned in the [publishing with a pinning service quickstart guide](./publish.md) using an IPFS Kubo node and an IPFS Gateway.
142
+
You then fetched the image that was pinned in the [publishing with a pinning service quickstart guide](./pin.md) using an IPFS Kubo node and an IPFS Gateway.
0 commit comments