Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions raster/parquet-raster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Parquet Raster Specification

## Overview

The [Apache Parquet](https://parquet.apache.org/) provides a standardized open-source columnar storage format and it also natively supports geo types (i.e., Geometry and Geography types). The Parquet Raster specification defines how geo-referenced raster imagery data (abbr., raster) should be stored in parquet format, including the representation of raster and the required additional metadata.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt).

## Raster columns

A raster column MUST be stored as a `struct` type column in parquet files. The `struct` type MUST contain the fields defined in the following table. The `raster` column MUST be stored in the root level of the parquet file.

Each raster column must also have a corresponding `Geometry` or `Geography` type column, stored in the top level of the parquet file. The name of the geometry column MUST be specified in the `geometry` field of the raster column metadata.

## Raster Representation

The raster data model is largely inspired by the WKB raster encoding of PostGIS but extracts the raster metadata out of the binary encoding.

### Raster value

A raster value is composed by the following components:

| Field | Parquet Physical Type | Parquet Logical Type | Description |
|--------------|-----------------------|----------------------|-------------------------------------------------------------------------|
| `endianness` | `boolean` | | **REQUIRED.** True: little endian; False: big endian |
Comment thread
jiayuasu marked this conversation as resolved.
Outdated
| `crs` | `BYTE_ARRAY` | UTF8 | **OPTIONAL.** The coordinate reference system of the raster |
| `scale_x` | `DOUBLE` | | **REQUIRED.** The scale factor of the raster in X direction |
| `scale_y` | `DOUBLE` | | **REQUIRED.** The scale factor of the raster in Y direction |
| `ip_x` | `DOUBLE` | | **REQUIRED.** The X coordinate of the upper left corner of the raster |
| `ip_y` | `DOUBLE` | | **REQUIRED.** The Y coordinate of the upper left corner of the raster |
| `skew_x` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in X direction |
| `skew_y` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in Y direction |
| `width` | `INT32` | | **REQUIRED.** The width of the raster in pixels |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is restricted to 2D rasters. Is this is a conscious choice to not allow n-D rasters such as permitted by Zarr or netCDF ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We want to allow n-D rasters! We should definitely figure out how to incorporate them in this spec.

| `height` | `INT32` | | **REQUIRED.** The height of the raster in pixels |
| `bands` | `BYTE_ARRAY` | List<BYTE_ARRAY> | **REQUIRED.** The bands of the raster. See the band data encoding below |
Comment thread
jiayuasu marked this conversation as resolved.

A raster is one or more grids of cells. All the grids should have `width` rows and `height` columns. The grid cells are represented by the `band` field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the `crs` field. For more details, please refer to the [CRS Customization](#crs-customization) section.

The geo-referencing information is represented by the parameters of an affine transformation (`ip_x`, `ip_y`, `scale_x`, `scale_y`, `skew_x`, `skew_y`). This specification only supports affine transformation as geo-referencing transformation, other transformations such as polynomial transformation are not supported.

The grid coordinates of a raster is always anchored at the center of grid cells. The translation factor of the affine transformation `ip_x` and `ip_y` also designates the world coordinate of the center of the upper left grid cell.

This specification supports persisting raster band values in two different ways specified by the `isOffline` flag in the band data encoding. The two options are:

* **in-db**: The band values are stored in the same Parquet file as the geo-referencing information.
* **out-db**: The band values are stored in files external to the Parquet file.

### Band data encoding

| Name | Type | Meaning |
Comment thread
jiayuasu marked this conversation as resolved.
Outdated
|------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `isOffline` | 1 bit | If true, data is found on external storage, through the path specified in `RASTERDATA`. |
| `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. |
| `isNodataValue` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. |
Comment thread
jiayuasu marked this conversation as resolved.
Outdated
| `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. |
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naively replaces the reserved field of WKB raster: https://github.com/postgis/postgis/blob/master/raster/doc/RFC2-WellKnownBinaryFormat#L86

Open to better ideas!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having just a single bit for the compression method is not very future proof ? What about a full byte with an enumeration with just NONE and GZIP for now ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what's the point of GZIP'ping given that Parquet natively supports GZIP ?

Copy link
Copy Markdown

@migurski migurski Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our customers typically upload these files to data warehouses where their uncompressed sizes are counted against user quotas. Compressing inside the data will allow customer table sizes to match file sizes more closely for sparse or indexed bands.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to measure this, but I believe that most Parquet readers have to decompress whatever they read in pages of some number of values (I think thousands is the default but I'm not sure if this adapts to the number of bytes in them). Having some control over which values are decompressed could possibly be very important when reading (particularly when reading and then filtering); however, this has come up a number of times and so I think we need to quantify this for the next person who asks 🙂

| `pixtype` | 4 bits | Pixel type: <br>0: 1-bit boolean<br>1: 2-bit unsigned integer<br>2: 4-bit unsigned integer<br>3: 8-bit signed integer<br>4: 8-bit unsigned integer<br>5: 16-bit signed integer<br>6: 16-bit unsigned signed integer<br>7: 32-bit signed integer<br>8: 32-bit unsigned signed integer<br>10: 32-bit float<br>11: 64-bit float |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a provision for 16-bit float. Cf https://gdal.org/en/latest/development/rfc/rfc100_float16_support.html
What about 64-bit signed/unsigned integers? They are a bit esoteric, but supported by GDAL

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 1, 2, 4-bit integers are encoded on a full byte, perhaps pixtype should be decomposed in two separate fields: one to indicate the nature (signed integer, unsigned integer, IEEE floating-point) and another one the bit width. That way this could preserve a metadata information for e.g. 12-bit unsigned rasters.

Comment thread
jiayuasu marked this conversation as resolved.
Outdated
| `nodata` | 1 to 8 bytes (depending on `pixtype` [1]) | Nodata value. |
| `data` | byte_array | Raster band pixel data (see below). |
Comment thread
jiayuasu marked this conversation as resolved.
Outdated

### In-DB pixel data encoding

This encoding is used when `isOffline` flag is false.

| Name | Type | Meaning |
|--------------|-----------------|---------|
| `pix[w*h]` | 1 to 8 bytes (depending on `pixtype` [1]) | Pixel values, row after row. `pix[0]` is the upper-left, `pix[w-1]` is the upper-right. <br><br>Endianness is specified at the start of WKB. It is implicit up to 8 bits (bit-order is most significant first). |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bit-order is most significant first" not understanding this. For 1-bit data, does that mean that value 1 is encoded as (1 << 7) ?


### Out-DB pixel data encoding

This encoding is used when `isOffline` flag is true.

| Name | Type | Meaning |
|--------------|-----------|-------------------------------------------------------------------------|
| `bandNumber` | int8 | 0-based band number to use from the set available in the external file. |
| `url` | string | The URI of the out-db raster file (e.g., GeoTIFF files) |
Comment thread
jiayuasu marked this conversation as resolved.
Outdated

The allowed URI schemes are:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting some obvious schemes here. Open to new ideas!

* `file://`: Local file system
* `http://`: HTTP
* `https://`: HTTPS

---

[1] Note: 1, 2, and 4 bit `pixtype`s are still encoded as 1 byte per value.

### CRS Customization

CRS is represented as a string value. Writer and reader implementations are
responsible for serializing and deserializing the CRS, respectively.

As a convention to maximize the interoperability, custom CRS values can be
specified by a string of the format `type:identifier`, where `type` is one of
the following values:

* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRID by itself doesn't mean much if you don't point to a spatial_ref_sys table...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this and determined that common EPSG values are the most likely ones to appear here. A user can choose to reference a particular table, but in general values like 4326, 3857, or 32610 could be interpreted by convention.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration is EPSG codes aren't necessarily stably defined over time. As an implementer, I would find the proposed approach frustrating because the definition of 4326 could be an ID or the PROJJSON that defines 4326. The latter might be more precisely defined than the former.

One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage. Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition? I would try to make the case that just PROJJSON is good enough. Plenty of discussion on the topic in GeoParquet's repo.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely sympathetic to inlining PROJJSON into each value because that's what I want for the world (pun slightly intended). If we do that, we should quantify the overhead we're asking for (a naive generation of PROJJSON for EPSG:3857 gives me 2200 bytes, a 512x512 png with 10 points is 10,000 bytes, and with 1000 points it's 100000 bytes. 2% overhead is probably acceptable but 20% overhead seems worth offering an alternative for?). PROJJSON in file metadata seems like reasonable middle ground?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user can choose to reference a particular table

How would they reference a table? I don't see a way to do that in the spec. Last time we got deep into this I did wonder if we should just publish a table of what PostGIS does somewhere stable and then reference it. All the database systems (and db-centered formats like Geopackage and Iceberg) work fine with SRID's since it's easy for them to throw in an extra table somewhere. But with parquet we don't have that luxury, and I think just going 'by convention' isn't actually robust enough - we need to define that convention somewhere.

but in general values like 4326, 3857, or 32610 could be interpreted by convention.

If we don't want to define a full table somewhere then I do think we should define these common ones in the spec, so everyone is on the same page as to what they are. I agree that the majority of data is covered by a small number of these, so I like the idea of making the majority case as easy as possible.

But I also do like things to be unambiguous, giving implementors just one route, so do lean a bit towards just picking one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage.

I always liked this idea, but unfortunately I don't know of any implementations that are actually doing this, and when I've tried to encourage people (like Java people or Esri), it hasn't been compelling at all to them. Maybe we could do more to explain it within the spec, like lay out the common ones and what people should do...

Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition?

I don't regret doing PROJJSON in Geoparquet, but it's clearly the decision that's given everyone the most heartache (though the alternatives all seemed less good). I do think a complete projjson here is a less of a slam dunk to me, as all the other metadata isn't already json, which was the case in geoparquet. And I am sympathetic to just doing something really simple for the most common CRS's, so people can just glance at the field and understand what it is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROJJSON in file metadata seems like reasonable middle ground?

Are you saying that projjson would be in the file metadata as the default, and then at the row level people could override that?

The satellite imagery use case requires us to have some row level CRS...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classic problem that's been cycling for a long time

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, having a binary geometry include the CRS so every (row) geometry is self-contained is what EWKB(?), and the old Manifold binary, it's the same as an encoded (tile/chunk/row/array), I think we're post-selfcontained era, that was for spitting gis data around the internet, we wouldn't require a tile(chunk/row/array) to represent its bbox/transform or position in a larger logical array. These are container (layer) metadata

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I have the full answer here. It may be that if srid is stored in native Parquet that the Parquet compression will take care of most of the overhead (or it may even be dictionary-encoded automatically!).

* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored.
Comment thread
jiayuasu marked this conversation as resolved.
Outdated


## Metadata

Parquet Raster files include additional metadata at two levels:

1. File metadata indicating things like the version of this specification used
2. Column metadata with additional metadata for each raster column

### File metadata

| Field Name | Type | Description |
| ------------------ | ------ |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| version | string | **REQUIRED.** The version identifier for the Parquet Raster specification. |
| primary_column | string | **REQUIRED.** The name of the "primary" raster column. In cases where a Parquet file contains multiple raster columns, the primary raster may be used by default in raster operations. |
| columns | object\<string, [Column Metadata](#column-metadata)> | **REQUIRED.** Metadata about raster columns. Each key is the name of a raster column in the table. |

At this level, additional implementation-specific fields (e.g. library name) MAY be present, and readers should be robust in ignoring those.

### Column metadata

Each raster column in the dataset MUST be included in the `columns` field above with the following content, keyed by the column name:

| Field Name | Type | Description |
|------------| ------------ |------------------------------------------------------------------------------------------|
| geometry | string | **REQUIRED.** Name of the geo-reference column to help accelerate spatial data retrieval |
Loading