Skip to content

Commit 4fdd401

Browse files
gadomskikylebarronTom Augspurger
authored
fix: add schema.md (#13)
I missed this file during my initial git export, but I think it belongs here, not in https://github.com/stac-utils/stac-geoparquet, as it doesn't have any code examples from that repo. --------- Co-authored-by: Kyle Barron <[email protected]> Co-authored-by: Tom Augspurger <[email protected]>
1 parent 83ccdc9 commit 4fdd401

File tree

2 files changed

+43
-0
lines changed

2 files changed

+43
-0
lines changed

docs/schema.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Schema considerations
2+
3+
A STAC Item is a JSON object to describe an external geospatial dataset. The STAC specification defines a common core, plus a variety of extensions. Additionally, STAC Items may include custom extensions outside the common ones. Crucially, the majority of the specified fields in the core spec and extensions define optional keys. Those keys often differ across STAC collections and may even differ within a single collection across items.
4+
5+
STAC's flexibility is a blessing and a curse. The flexibility of schemaless JSON allows for very easy writing as each object can be dumped separately to JSON. Every item is allowed to have a different schema. And newer items are free to have a different schema than older items in the same collection. But this write-time flexibility makes it harder to read as there are no guarantees (outside STAC's few required fields) about what fields exist.
6+
7+
Parquet is the complete opposite of JSON. Parquet has a strict schema that must be known before writing can start. This puts the burden of work onto the writer instead of the reader. Reading Parquet is very efficient because the file's metadata defines the exact schema of every record. This also enables use cases like reading specific columns that would not be possible without a strict schema.
8+
9+
This conversion from schemaless to strict-schema is the difficult part of converting STAC from JSON to GeoParquet, especially for large input datasets like STAC that are often larger than memory.
10+
11+
## Full scan over input data
12+
13+
The most foolproof way to convert STAC JSON to GeoParquet is to perform a full scan over input data. This is done automatically by [`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow] when a schema is not provided.
14+
15+
This is time consuming as it requires two full passes over the input data: once to infer a common schema and again to actually write to Parquet (though items are never fully held in memory, allowing this process to scale).
16+
17+
## User-provided schema
18+
19+
Alternatively, the user can pass in an Arrow schema themselves using the `schema` parameter of [`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow]. This `schema` must match the on-disk schema of the the STAC JSON data.
20+
21+
## Multiple schemas per collection
22+
23+
It is also possible to write multiple Parquet files with STAC data where each Parquet file may have a different schema. This simplifies the conversion and writing process but makes reading and using the Parquet data harder.
24+
25+
### Merging data with schema mismatch
26+
27+
If you've created STAC GeoParquet data where the schema has updated, you can use [`pyarrow.concat_tables`][pyarrow.concat_tables] with `promote_options="permissive"` to combine multiple STAC GeoParquet files.
28+
29+
```py
30+
import pyarrow as pa
31+
import pyarrow.parquet as pq
32+
33+
table_1 = pq.read_table("stac1.parquet")
34+
table_2 = pq.read_table("stac2.parquet")
35+
combined_table = pa.concat_tables([table1, table2], promote_options="permissive")
36+
```
37+
38+
## Future work
39+
40+
Schema operations is an area where future work can improve reliability and ease of use of STAC GeoParquet.
41+
42+
It's possible that in the future we could automatically infer an Arrow schema from the STAC specification's published JSON Schema files. If you're interested in this, open an issue and discuss.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ extra_css:
3636

3737
nav:
3838
- index.md
39+
- schema.md
3940
- drawbacks.md
4041

4142
plugins:

0 commit comments

Comments
 (0)