Skip to content

Commit c159476

Browse files
Add the initial draft of the Parquet Raster standard (#259)
* Add the initial draft of Parquet Raster to start the discussion * Move to format-specs per Chris' feedback * First round of fix * Add work in progress * Update format-specs/parquet-raster.md Co-authored-by: Kristin Cowalcijk <bo@wherobots.com> --------- Co-authored-by: Kristin Cowalcijk <bo@wherobots.com>
1 parent 4070f41 commit c159476

1 file changed

Lines changed: 133 additions & 0 deletions

File tree

format-specs/parquet-raster.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# [Work in Progress] Parquet Raster Specification
2+
3+
## Overview
4+
5+
The [Apache Parquet](https://parquet.apache.org/) provides a standardized open-source columnar storage format and it also natively supports geo types (i.e., Geometry and Geography types). The Parquet Raster specification defines how geo-referenced raster imagery data (abbr., raster) should be stored in parquet format, including the representation of raster and the required additional metadata.
6+
7+
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).
8+
9+
## Raster columns
10+
11+
A raster column MUST be stored as a `struct` type column in parquet files. The `struct` type MUST contain the fields defined in the following table. The `raster` column MUST be stored in the root level of the parquet file.
12+
13+
Each raster column must also have a corresponding `Geometry` or `Geography` type column, stored in the top level of the parquet file. The name of the geometry column MUST be specified in the `geometry` field of the raster column metadata.
14+
15+
## Raster Representation
16+
17+
The raster data model is largely inspired by the WKB raster encoding of PostGIS but extracts the raster metadata out of the binary encoding. It always uses the little-endian byte order for the raster data.
18+
19+
### Raster value
20+
21+
A raster value is composed by the following components:
22+
23+
| Field | Parquet Physical Type | Parquet Logical Type | Description |
24+
|--------------|-----------------------|----------------------|-------------------------------------------------------------------------|
25+
| `crs` | `BYTE_ARRAY` | UTF8 | **OPTIONAL.** The coordinate reference system of the raster |
26+
| `scale_x` | `DOUBLE` | | **REQUIRED.** The scale factor of the raster in X direction |
27+
| `scale_y` | `DOUBLE` | | **REQUIRED.** The scale factor of the raster in Y direction |
28+
| `ip_x` | `DOUBLE` | | **REQUIRED.** The X coordinate of the upper left corner of the raster |
29+
| `ip_y` | `DOUBLE` | | **REQUIRED.** The Y coordinate of the upper left corner of the raster |
30+
| `skew_x` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in X direction |
31+
| `skew_y` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in Y direction |
32+
| `width` | `INT32` | | **REQUIRED.** The width of the raster in pixels |
33+
| `height` | `INT32` | | **REQUIRED.** The height of the raster in pixels |
34+
| `bands` | `BYTE_ARRAY` | List<BYTE_ARRAY> | **REQUIRED.** The bands of the raster. See the band data encoding below |
35+
36+
A raster is one or more grids of cells. All the grids should have `width` rows and `height` columns. The grid cells are represented by the `band` field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the `crs` field. For more details, please refer to the [CRS Customization](#crs-customization) section.
37+
38+
The geo-referencing information is represented by the parameters of an affine transformation (`ip_x`, `ip_y`, `scale_x`, `scale_y`, `skew_x`, `skew_y`). This specification only supports affine transformation as geo-referencing transformation, other transformations such as polynomial transformation are not supported.
39+
40+
The affine transformation is defined as follows:
41+
42+
```
43+
world_x = ip_x + (col + 0.5) * scale_x + (row + 0.5) * skew_x
44+
world_y = ip_y + (col + 0.5) * skew_y + (row + 0.5) * scale_y
45+
```
46+
47+
col = the column number (pixel index) from the left (0 is the first/leftmost column)
48+
row = the row number (pixel index) from the top (0 is the first/topmost row)
49+
50+
The grid coordinates of a raster is always anchored at the center of grid cells. The translation factor of the affine transformation `ip_x` and `ip_y` also designates the world coordinate of the center of the upper left grid cell.
51+
52+
This specification supports persisting raster band values in two different ways specified by the `isOffline` flag in the band data encoding. The two options are:
53+
54+
* **in-db**: The band values are stored in the same Parquet file as the geo-referencing information.
55+
* **out-db**: The band values are stored in files external to the Parquet file.
56+
57+
### Band data encoding
58+
59+
| Name | Type | Meaning |
60+
|-------------------|-------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
61+
| `isOffline` | 1 bit | If true, data is found on external storage, through the path specified in `RASTERDATA`. |
62+
| `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. |
63+
| `isAllNodata` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. |
64+
| `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. |
65+
| `pixtype` | 4 bits | Pixel type: <br>0: 1-bit boolean<br>1: 2-bit unsigned integer<br>2: 4-bit unsigned integer<br>3: 8-bit signed integer<br>4: 8-bit unsigned integer<br>5: 16-bit signed integer<br>6: 16-bit unsigned integer<br>7: 32-bit signed integer<br>8: 32-bit unsigned integer<br>10: 32-bit float<br>11: 64-bit float |
66+
| `nodata` | 1 to 8 bytes (depending on `pixtype` [1]) | Nodata value. |
67+
| `length` | int64 | Length of the `data` byte_array in bytes. |
68+
| `data` | byte_array | Raster band pixel data (see below). |
69+
70+
### In-DB pixel data encoding
71+
72+
This encoding is used when `isOffline` flag is false.
73+
74+
| Name | Type | Meaning |
75+
|--------------|-----------------|---------|
76+
| `pix[w*h]` | 1 to 8 bytes (depending on `pixtype` [1]) | Pixel values, row after row. `pix[0]` is the upper-left, `pix[w-1]` is the upper-right. <br><br>Endianness is specified at the start of WKB. It is implicit up to 8 bits (bit-order is most significant first). |
77+
78+
### Out-DB pixel data encoding
79+
80+
This encoding is used when `isOffline` flag is true.
81+
82+
| Name | Type | Meaning |
83+
|--------------|--------|-------------------------------------------------------------------------|
84+
| `bandNumber` | int8 | 0-based band number to use from the set available in the external file. |
85+
| `length` | int16 | Length of the `url` string in bytes. |
86+
| `url` | string | The URI of the out-db raster file (e.g., GeoTIFF files). |
87+
88+
The allowed URI schemes are:
89+
* `file://`: Local file system
90+
* `http://`: HTTP
91+
* `https://`: HTTPS
92+
93+
---
94+
95+
[1] Note: 1, 2, and 4 bit `pixtype`s are still encoded as 1 byte per value.
96+
97+
### CRS Customization
98+
99+
CRS is represented as a string value. Writer and reader implementations are
100+
responsible for serializing and deserializing the CRS, respectively.
101+
102+
As a convention to maximize the interoperability, custom CRS values can be
103+
specified by a string of the format `type:value`, where `type` is one of
104+
the following values:
105+
106+
* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `value` is the SRID itself.
107+
* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `value` is the PROJJSON string.
108+
109+
110+
## Metadata
111+
112+
Parquet Raster files include additional metadata at two levels:
113+
114+
1. File metadata indicating things like the version of this specification used
115+
2. Column metadata with additional metadata for each raster column
116+
117+
### File metadata
118+
119+
| Field Name | Type | Description |
120+
| ------------------ | ------ |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
121+
| version | string | **REQUIRED.** The version identifier for the Parquet Raster specification. |
122+
| primary_column | string | **REQUIRED.** The name of the "primary" raster column. In cases where a Parquet file contains multiple raster columns, the primary raster may be used by default in raster operations. |
123+
| columns | object\<string, [Column Metadata](#column-metadata)> | **REQUIRED.** Metadata about raster columns. Each key is the name of a raster column in the table. |
124+
125+
At this level, additional implementation-specific fields (e.g. library name) MAY be present, and readers should be robust in ignoring those.
126+
127+
### Column metadata
128+
129+
Each raster column in the dataset, although annotated with Parquet `struct` type, MUST be included in the `columns` field above with the following content, keyed by the column name:
130+
131+
| Field Name | Type | Description |
132+
|------------| ------------ |------------------------------------------------------------------------------------------|
133+
| geometry | string | **REQUIRED.** Name of the geo-reference column to help accelerate spatial data retrieval |

0 commit comments

Comments
 (0)