|
| 1 | +# [Work in Progress] Parquet Raster Specification |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The [Apache Parquet](https://parquet.apache.org/) provides a standardized open-source columnar storage format and it also natively supports geo types (i.e., Geometry and Geography types). The Parquet Raster specification defines how geo-referenced raster imagery data (abbr., raster) should be stored in parquet format, including the representation of raster and the required additional metadata. |
| 6 | + |
| 7 | +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). |
| 8 | + |
| 9 | +## Raster columns |
| 10 | + |
| 11 | +A raster column MUST be stored as a `struct` type column in parquet files. The `struct` type MUST contain the fields defined in the following table. The `raster` column MUST be stored in the root level of the parquet file. |
| 12 | + |
| 13 | +Each raster column must also have a corresponding `Geometry` or `Geography` type column, stored in the top level of the parquet file. The name of the geometry column MUST be specified in the `geometry` field of the raster column metadata. |
| 14 | + |
| 15 | +## Raster Representation |
| 16 | + |
| 17 | +The raster data model is largely inspired by the WKB raster encoding of PostGIS but extracts the raster metadata out of the binary encoding. It always uses the little-endian byte order for the raster data. |
| 18 | + |
| 19 | +### Raster value |
| 20 | + |
| 21 | +A raster value is composed by the following components: |
| 22 | + |
| 23 | +| Field | Parquet Physical Type | Parquet Logical Type | Description | |
| 24 | +|--------------|-----------------------|----------------------|-------------------------------------------------------------------------| |
| 25 | +| `crs` | `BYTE_ARRAY` | UTF8 | **OPTIONAL.** The coordinate reference system of the raster | |
| 26 | +| `scale_x` | `DOUBLE` | | **REQUIRED.** The scale factor of the raster in X direction | |
| 27 | +| `scale_y` | `DOUBLE` | | **REQUIRED.** The scale factor of the raster in Y direction | |
| 28 | +| `ip_x` | `DOUBLE` | | **REQUIRED.** The X coordinate of the upper left corner of the raster | |
| 29 | +| `ip_y` | `DOUBLE` | | **REQUIRED.** The Y coordinate of the upper left corner of the raster | |
| 30 | +| `skew_x` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in X direction | |
| 31 | +| `skew_y` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in Y direction | |
| 32 | +| `width` | `INT32` | | **REQUIRED.** The width of the raster in pixels | |
| 33 | +| `height` | `INT32` | | **REQUIRED.** The height of the raster in pixels | |
| 34 | +| `bands` | `BYTE_ARRAY` | List<BYTE_ARRAY> | **REQUIRED.** The bands of the raster. See the band data encoding below | |
| 35 | + |
| 36 | +A raster is one or more grids of cells. All the grids should have `width` rows and `height` columns. The grid cells are represented by the `band` field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the `crs` field. For more details, please refer to the [CRS Customization](#crs-customization) section. |
| 37 | + |
| 38 | +The geo-referencing information is represented by the parameters of an affine transformation (`ip_x`, `ip_y`, `scale_x`, `scale_y`, `skew_x`, `skew_y`). This specification only supports affine transformation as geo-referencing transformation, other transformations such as polynomial transformation are not supported. |
| 39 | + |
| 40 | +The affine transformation is defined as follows: |
| 41 | + |
| 42 | +``` |
| 43 | +world_x = ip_x + (col + 0.5) * scale_x + (row + 0.5) * skew_x |
| 44 | +world_y = ip_y + (col + 0.5) * skew_y + (row + 0.5) * scale_y |
| 45 | +``` |
| 46 | + |
| 47 | +col = the column number (pixel index) from the left (0 is the first/leftmost column) |
| 48 | +row = the row number (pixel index) from the top (0 is the first/topmost row) |
| 49 | + |
| 50 | +The grid coordinates of a raster is always anchored at the center of grid cells. The translation factor of the affine transformation `ip_x` and `ip_y` also designates the world coordinate of the center of the upper left grid cell. |
| 51 | + |
| 52 | +This specification supports persisting raster band values in two different ways specified by the `isOffline` flag in the band data encoding. The two options are: |
| 53 | + |
| 54 | +* **in-db**: The band values are stored in the same Parquet file as the geo-referencing information. |
| 55 | +* **out-db**: The band values are stored in files external to the Parquet file. |
| 56 | + |
| 57 | +### Band data encoding |
| 58 | + |
| 59 | +| Name | Type | Meaning | |
| 60 | +|-------------------|-------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 61 | +| `isOffline` | 1 bit | If true, data is found on external storage, through the path specified in `RASTERDATA`. | |
| 62 | +| `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. | |
| 63 | +| `isAllNodata` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. | |
| 64 | +| `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. | |
| 65 | +| `pixtype` | 4 bits | Pixel type: <br>0: 1-bit boolean<br>1: 2-bit unsigned integer<br>2: 4-bit unsigned integer<br>3: 8-bit signed integer<br>4: 8-bit unsigned integer<br>5: 16-bit signed integer<br>6: 16-bit unsigned integer<br>7: 32-bit signed integer<br>8: 32-bit unsigned integer<br>10: 32-bit float<br>11: 64-bit float | |
| 66 | +| `nodata` | 1 to 8 bytes (depending on `pixtype` [1]) | Nodata value. | |
| 67 | +| `length` | int64 | Length of the `data` byte_array in bytes. | |
| 68 | +| `data` | byte_array | Raster band pixel data (see below). | |
| 69 | + |
| 70 | +### In-DB pixel data encoding |
| 71 | + |
| 72 | +This encoding is used when `isOffline` flag is false. |
| 73 | + |
| 74 | +| Name | Type | Meaning | |
| 75 | +|--------------|-----------------|---------| |
| 76 | +| `pix[w*h]` | 1 to 8 bytes (depending on `pixtype` [1]) | Pixel values, row after row. `pix[0]` is the upper-left, `pix[w-1]` is the upper-right. <br><br>Endianness is specified at the start of WKB. It is implicit up to 8 bits (bit-order is most significant first). | |
| 77 | + |
| 78 | +### Out-DB pixel data encoding |
| 79 | + |
| 80 | +This encoding is used when `isOffline` flag is true. |
| 81 | + |
| 82 | +| Name | Type | Meaning | |
| 83 | +|--------------|--------|-------------------------------------------------------------------------| |
| 84 | +| `bandNumber` | int8 | 0-based band number to use from the set available in the external file. | |
| 85 | +| `length` | int16 | Length of the `url` string in bytes. | |
| 86 | +| `url` | string | The URI of the out-db raster file (e.g., GeoTIFF files). | |
| 87 | + |
| 88 | +The allowed URI schemes are: |
| 89 | +* `file://`: Local file system |
| 90 | +* `http://`: HTTP |
| 91 | +* `https://`: HTTPS |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +[1] Note: 1, 2, and 4 bit `pixtype`s are still encoded as 1 byte per value. |
| 96 | + |
| 97 | +### CRS Customization |
| 98 | + |
| 99 | +CRS is represented as a string value. Writer and reader implementations are |
| 100 | +responsible for serializing and deserializing the CRS, respectively. |
| 101 | + |
| 102 | +As a convention to maximize the interoperability, custom CRS values can be |
| 103 | +specified by a string of the format `type:value`, where `type` is one of |
| 104 | +the following values: |
| 105 | + |
| 106 | +* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `value` is the SRID itself. |
| 107 | +* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `value` is the PROJJSON string. |
| 108 | + |
| 109 | + |
| 110 | +## Metadata |
| 111 | + |
| 112 | +Parquet Raster files include additional metadata at two levels: |
| 113 | + |
| 114 | +1. File metadata indicating things like the version of this specification used |
| 115 | +2. Column metadata with additional metadata for each raster column |
| 116 | + |
| 117 | +### File metadata |
| 118 | + |
| 119 | +| Field Name | Type | Description | |
| 120 | +| ------------------ | ------ |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 121 | +| version | string | **REQUIRED.** The version identifier for the Parquet Raster specification. | |
| 122 | +| primary_column | string | **REQUIRED.** The name of the "primary" raster column. In cases where a Parquet file contains multiple raster columns, the primary raster may be used by default in raster operations. | |
| 123 | +| columns | object\<string, [Column Metadata](#column-metadata)> | **REQUIRED.** Metadata about raster columns. Each key is the name of a raster column in the table. | |
| 124 | + |
| 125 | +At this level, additional implementation-specific fields (e.g. library name) MAY be present, and readers should be robust in ignoring those. |
| 126 | + |
| 127 | +### Column metadata |
| 128 | + |
| 129 | +Each raster column in the dataset, although annotated with Parquet `struct` type, MUST be included in the `columns` field above with the following content, keyed by the column name: |
| 130 | + |
| 131 | +| Field Name | Type | Description | |
| 132 | +|------------| ------------ |------------------------------------------------------------------------------------------| |
| 133 | +| geometry | string | **REQUIRED.** Name of the geo-reference column to help accelerate spatial data retrieval | |
0 commit comments