Add the initial draft of the Parquet Raster standard by jiayuasu · Pull Request #259 · opengeospatial/geoparquet

jiayuasu · 2025-04-28T08:06:17Z

Parquet Raster aims to define a standard for storing raster imagery data in the Parquet format, making it easier for existing cloud data warehouses to read, write, and exchange raster data at scale.

This effort is initiated by the GeoParquet community, with contributions from companies such as Carto and Wherobots. The specification is largely inspired by the PostGIS WKB Raster encoding [1], and incorporates ideas from existing efforts like Carto’s RaQuet [2] and Wherobots’ Havasu Iceberg specification [3].

This is still a work-in-progress. I am creating this PR to gather feedback. Many people have contributed to the early discussions, including:
• @migurski from Carto
• @kylebarron from DevSeed
• @pramsey from PostGIS
• @jiayuasu , @paleolimbot , and @Kontinuation from Wherobots

Key highlights:
1. Raster metadata is stored using Parquet struct types, while pixel data is stored as binary.
2. Supports both in-database (in-DB) and out-of-database (out-DB) storage.
3. A raster column must be accompanied by a geometry column to enable spatial predicate pushdown.
4. Supports optional additional GZIP compression for in-database pixel storage.

[1] WKB raster encoding: https://github.com/postgis/postgis/blob/master/raster/doc/RFC2-WellKnownBinaryFormat
[2] Carto RaQuet spec: https://github.com/CartoDB/raquet
[3] Wherobots Havasu Iceberg spec: https://github.com/wherobots/havasu/blob/main/spec.md

jiayuasu · 2025-04-28T08:07:28Z

+| `isOffline`      | 1 bit          | If true, data is found on external storage, through the path specified in `RASTERDATA`.                                                                                                                                                                                                                                      |
+| `hasNodataValue` | 1 bit          | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored.                                                                                                                                                                                                                      |
+| `isNodataValue`  | 1 bit          | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument.                                                                                                                                  |
+| `isGZIPPed`      | 1 bit          | If true, the data is compressed using GZIP before being passed to the Parquet compression process.                                                                                                                                                                                                                            |


This naively replaces the reserved field of WKB raster: https://github.com/postgis/postgis/blob/master/raster/doc/RFC2-WellKnownBinaryFormat#L86

Open to better ideas!

jiayuasu · 2025-04-28T08:08:22Z

+| `bandNumber` | int8       | 0-based band number to use from the set available in the external file. |
+| `url`        | string     | The URI of the out-db raster file (e.g., GeoTIFF files)                 |
+
+The allowed URI schemes are:


Putting some obvious schemes here. Open to new ideas!

rouault · 2025-04-28T09:21:20Z

+| `hasNodataValue` | 1 bit          | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored.                                                                                                                                                                                                                      |
+| `isNodataValue`  | 1 bit          | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument.                                                                                                                                  |
+| `isGZIPPed`      | 1 bit          | If true, the data is compressed using GZIP before being passed to the Parquet compression process.                                                                                                                                                                                                                            |
+| `pixtype`        | 4 bits         | Pixel type: <br>0: 1-bit boolean<br>1: 2-bit unsigned integer<br>2: 4-bit unsigned integer<br>3: 8-bit signed integer<br>4: 8-bit unsigned integer<br>5: 16-bit signed integer<br>6: 16-bit unsigned signed integer<br>7: 32-bit signed integer<br>8: 32-bit unsigned signed integer<br>10: 32-bit float<br>11: 64-bit float |


There should be a provision for 16-bit float. Cf https://gdal.org/en/latest/development/rfc/rfc100_float16_support.html
What about 64-bit signed/unsigned integers? They are a bit esoteric, but supported by GDAL

If 1, 2, 4-bit integers are encoded on a full byte, perhaps pixtype should be decomposed in two separate fields: one to indicate the nature (signed integer, unsigned integer, IEEE floating-point) and another one the bit width. That way this could preserve a metadata information for e.g. 12-bit unsigned rasters.

rouault · 2025-04-28T09:24:15Z

+| `isOffline`      | 1 bit          | If true, data is found on external storage, through the path specified in `RASTERDATA`.                                                                                                                                                                                                                                      |
+| `hasNodataValue` | 1 bit          | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored.                                                                                                                                                                                                                      |
+| `isNodataValue`  | 1 bit          | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument.                                                                                                                                  |
+| `isGZIPPed`      | 1 bit          | If true, the data is compressed using GZIP before being passed to the Parquet compression process.                                                                                                                                                                                                                            |


Having just a single bit for the compression method is not very future proof ? What about a full byte with an enumeration with just NONE and GZIP for now ?

but what's the point of GZIP'ping given that Parquet natively supports GZIP ?

Our customers typically upload these files to data warehouses where their uncompressed sizes are counted against user quotas. Compressing inside the data will allow customer table sizes to match file sizes more closely for sparse or indexed bands.

It would be good to measure this, but I believe that most Parquet readers have to decompress whatever they read in pages of some number of values (I think thousands is the default but I'm not sure if this adapts to the number of bytes in them). Having some control over which values are decompressed could possibly be very important when reading (particularly when reading and then filtering); however, this has come up a number of times and so I think we need to quantify this for the next person who asks 🙂

rouault · 2025-04-28T09:31:16Z

+specified by a string of the format `type:identifier`, where `type` is one of
+the following values:
+
+* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself.


SRID by itself doesn't mean much if you don't point to a spatial_ref_sys table...

We discussed this and determined that common EPSG values are the most likely ones to appear here. A user can choose to reference a particular table, but in general values like 4326, 3857, or 32610 could be interpreted by convention.

Another consideration is EPSG codes aren't necessarily stably defined over time. As an implementer, I would find the proposed approach frustrating because the definition of 4326 could be an ID or the PROJJSON that defines 4326. The latter might be more precisely defined than the former.

One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage. Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition? I would try to make the case that just PROJJSON is good enough. Plenty of discussion on the topic in GeoParquet's repo.

I'm definitely sympathetic to inlining PROJJSON into each value because that's what I want for the world (pun slightly intended). If we do that, we should quantify the overhead we're asking for (a naive generation of PROJJSON for EPSG:3857 gives me 2200 bytes, a 512x512 png with 10 points is 10,000 bytes, and with 1000 points it's 100000 bytes. 2% overhead is probably acceptable but 20% overhead seems worth offering an alternative for?). PROJJSON in file metadata seems like reasonable middle ground?

A user can choose to reference a particular table

How would they reference a table? I don't see a way to do that in the spec. Last time we got deep into this I did wonder if we should just publish a table of what PostGIS does somewhere stable and then reference it. All the database systems (and db-centered formats like Geopackage and Iceberg) work fine with SRID's since it's easy for them to throw in an extra table somewhere. But with parquet we don't have that luxury, and I think just going 'by convention' isn't actually robust enough - we need to define that convention somewhere.

but in general values like 4326, 3857, or 32610 could be interpreted by convention.

If we don't want to define a full table somewhere then I do think we should define these common ones in the spec, so everyone is on the same page as to what they are. I agree that the majority of data is covered by a small number of these, so I like the idea of making the majority case as easy as possible.

But I also do like things to be unambiguous, giving implementors just one route, so do lean a bit towards just picking one.

One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage.

I always liked this idea, but unfortunately I don't know of any implementations that are actually doing this, and when I've tried to encourage people (like Java people or Esri), it hasn't been compelling at all to them. Maybe we could do more to explain it within the spec, like lay out the common ones and what people should do...

Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition?

I don't regret doing PROJJSON in Geoparquet, but it's clearly the decision that's given everyone the most heartache (though the alternatives all seemed less good). I do think a complete projjson here is a less of a slam dunk to me, as all the other metadata isn't already json, which was the case in geoparquet. And I am sympathetic to just doing something really simple for the most common CRS's, so people can just glance at the field and understand what it is.

PROJJSON in file metadata seems like reasonable middle ground?

Are you saying that projjson would be in the file metadata as the default, and then at the row level people could override that?

The satellite imagery use case requires us to have some row level CRS...

Classic problem that's been cycling for a long time

Like, having a binary geometry include the CRS so every (row) geometry is self-contained is what EWKB(?), and the old Manifold binary, it's the same as an encoded (tile/chunk/row/array), I think we're post-selfcontained era, that was for spitting gis data around the internet, we wouldn't require a tile(chunk/row/array) to represent its bbox/transform or position in a larger logical array. These are container (layer) metadata

Not sure I have the full answer here. It may be that if srid is stored in native Parquet that the Parquet compression will take care of most of the overhead (or it may even be dictionary-encoded automatically!).

rouault · 2025-04-28T09:34:10Z

+| `ip_y`       | `DOUBLE`              |                      | **REQUIRED.** The Y coordinate of the upper left corner of the raster   |
+| `skew_x`     | `DOUBLE`              |                      | **REQUIRED.** The skew factor of the raster in X direction              |
+| `skew_y`     | `DOUBLE`              |                      | **REQUIRED.** The skew factor of the raster in Y direction              |
+| `width`      | `INT32`               |                      | **REQUIRED.** The width of the raster in pixels                         |


This proposal is restricted to 2D rasters. Is this is a conscious choice to not allow n-D rasters such as permitted by Zarr or netCDF ?

+1. We want to allow n-D rasters! We should definitely figure out how to incorporate them in this spec.

zhangfengcdt · 2025-04-28T20:27:22Z

+A raster is one or more grids of cells. All the grids should have `width` rows and `height` columns. The grid cells are represented by the `band` field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the `crs` field. For more details, please refer to the [CRS Customization](#crs-customization) section.
+
+The geo-referencing information is represented by the parameters of an affine transformation (`ip_x`, `ip_y`, `scale_x`, `scale_y`, `skew_x`, `skew_y`). This specification only supports affine transformation as geo-referencing transformation, other transformations such as polynomial transformation are not supported.
+


Do we need to explicitly define the affine formula to be more clear?

world_x = ip_x + (col + 0.5)*scale_x + (row + 0.5)*skew_x world_y = ip_y + (col + 0.5)*skew_y + (row + 0.5)*scale_y

col = the column number (pixel index) from the left (0 is the first/leftmost column)

row = the row number (pixel index) from the top (0 is the first/topmost row)

+1.

We may also want to consider aligning with STAC, see https://github.com/stac-extensions/projection - we just define shape and transform instead of the col, x, etc. Planet's data api did things that way, and the stac way felt like an improvement.

Or go from 6 double columns to a single double[] array using the GDAL geotransform standard for what each entry in the array is? More opaque to read, but easier to slam into GDAL.

mdsumner · 2025-04-28T22:22:29Z

Are these "tiled tables"? I.e. encoded arrays in blob columns? I don't get that clarity from this document, though it it is in the wkb doc, I think that should be described here : what records the pixel orientation and arrangement? and I think it would be great to describe related schemes like gpkg, postgis, manifold.net, probably tiledb, and Zarr (materialized or virtual). I don't really get the arrangement of pixel values yet but would love to help couch this in those broader terms.

mdsumner · 2025-04-29T02:03:23Z

fwiw I've answered mostly my questions by exploring the structure in raquet examples: https://github.com/CartoDB/raquet/tree/master/examples, certainly it's straightforward to unpack the blocks from the arrow memory and see how that works (I'm going to work harder to follow @rouault and others suggestions here). One thing that I think would be amazing would be to generalize like virtual Zarr has to encoded arrays that are referenced in the table but can live "anywhere" either in trad Zarr chunks or within legacy files (or anything). Appreciate that's way out of scope here probably but that seems to be the opportunity ahead of us generally. (thanks!)

cholmes · 2025-04-30T03:06:30Z

+| Name         | Type      | Meaning                                                                 |
+|--------------|-----------|-------------------------------------------------------------------------|
+| `bandNumber` | int8       | 0-based band number to use from the set available in the external file. |
+| `url`        | string     | The URI of the out-db raster file (e.g., GeoTIFF files)                 |


I think we should flesh out 'e.g., GeoTIFF files' a lot more. What types of raster data do we expect clients to support? What happens when someone puts in a value that the client doesn't support? Do we expect them to figure out if they can read it from the file name? Or are they expected to try to open it?

In STAC we don't just use a single url field, we have an object that also has a 'media type', see https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#working-with-media-types for info on how we use it. Clients can make use of that to figure out if they support it. We don't try to list all potential media types, but we share the common ones. An approach like that could potentiallyw ork here, but we'd need another field.

paleolimbot

We chatted about this a bit at the sync on Monday, but I wonder if there is a way to reduce the fields in the Parquet spec to purely what would be queried and inline the rest into either the binary representation or some JSON metadata. For argument's sake:

Other
Top-level
Parquet
Fields
raster: struct
- covering: geometry
- encoding: string (one of more well-thought out versions of 'uri', 'wkbraster', 'gzippedwkbraster', 'cartosnumpything')
- metadata: string (JSON with crs, shape, and anything else that might be required to implement deferred loading according to what you've specified here)
- data: binary (depends on encoding)

I'm sure the right answer is somewhere between putting everything into a fixed schema like this and expanding it all into Parquet...just thought I'd throw that out there!

paleolimbot · 2025-04-30T02:39:19Z

+* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored.
+
+
+## Metadata


Is any of this essential? File-level metadata can be difficult to incorporate into a Parquet reader/writer depending on the API it exposes and if it's not required I wonder if we should omit it to simplify reading and writing

I think none of them are essential. The idea of these file-level metadata is for geo-aware Parquet readers who actually understand geo and raster rather than generic Parquet readers. These tools can do better optimization if they can detect the Parquet Raster column.

Generic Parquet readers will work fine without knowing this column is a so-called raster type.

paleolimbot · 2025-04-30T02:50:09Z

+
+A raster column MUST be stored as a `struct` type column in parquet files. The `struct` type MUST contain the fields defined in the following table. The `raster` column MUST be stored in the root level of the parquet file.
+
+Each raster column must also have a corresponding `Geometry` or `Geography` type column, stored in the top level of the parquet file. The name of the geometry column MUST be specified in the `geometry` field of the raster column metadata.


I wonder if this would be just as effective (or more) if included in the struct. The statistics should also be available there and keeping it together is helpful when wrapping a file reader/writer interface (some of them are made easier if there's a 1:1 between the source/destination datatype and the parquet datatype)

cholmes · 2025-04-30T03:30:44Z

I wonder if there is a way to reduce the fields in the Parquet spec to purely what would be queried and inline the rest into either the binary representation or some JSON metadata.

+1 - I liked your suggestion of considering whether people actually need to query on it, and if not then it doesn't need to be in the parquet.

cholmes · 2025-05-04T16:13:47Z

What do others think about just merging this pretty soon, and then additional discussions could take place as individual PR's, instead of just having tons of little comment threads? Like just mark it as work in progress at the top of the doc.

jiayuasu · 2025-05-05T06:55:36Z

@cholmes

What do others think about just merging this pretty soon, and then additional discussions could take place as individual PR's, instead of just having tons of little comment threads? Like just mark it as work in progress at the top of the doc.

I agree. Now it is a bit hard to track all the comments in different threads. Let's discuss issues in individual PRs then?

I added Work in progress at the top of the doc.

Kontinuation · 2025-05-05T15:57:31Z

+Parquet Raster files include additional metadata at two levels:
+
+1. File metadata indicating things like the version of this specification used
+2. Column metadata with additional metadata for each raster column


We need to at least specify the key for storing these metadata. For instance, "A GeoParquet file MUST include a geo.raster key in the Parquet metadata".

cholmes

I'm going to approve this, so we can merge and then move the conversation to individual PR's.

migurski · 2025-05-05T17:49:50Z

Sounds like the right move!

Co-authored-by: Kristin Cowalcijk <bo@wherobots.com>

paleolimbot

Sounds good!

jiayuasu · 2025-05-05T18:36:06Z

After the PR merged, I will create individual GitHub issues referring to the comments in this PR

migurski · 2025-05-05T18:37:49Z

Big topics for followup issues:

Compression
SRID / CRS

cholmes · 2025-05-06T14:06:41Z

Merging this in, as discussed above and at the last zoom meeting. Discussion to continue in individual issues and PR's.

Add the initial draft of Parquet Raster to start the discussion

abb03ed

jiayuasu commented Apr 28, 2025

View reviewed changes

Comment thread raster/parquet-raster.md

rouault reviewed Apr 28, 2025

View reviewed changes

Move to format-specs per Chris' feedback

b775cfc

zhangfengcdt reviewed Apr 28, 2025

View reviewed changes

cholmes reviewed Apr 30, 2025

View reviewed changes

paleolimbot reviewed Apr 30, 2025

View reviewed changes

First round of fix

fda1a63

jiayuasu changed the title ~~[WIP] Create Parquet Raster standard~~ Add the initial draft of the Parquet Raster standard May 5, 2025

Add work in progress

0b48b8d

Kontinuation reviewed May 5, 2025

View reviewed changes

cholmes approved these changes May 5, 2025

View reviewed changes

jiayuasu marked this pull request as ready for review May 5, 2025 18:06

Update format-specs/parquet-raster.md

5f3f534

Co-authored-by: Kristin Cowalcijk <bo@wherobots.com>

paleolimbot approved these changes May 5, 2025

View reviewed changes

jiayuasu mentioned this pull request May 5, 2025

[Parquet-raster] Allowed URL schemes here #261

Open

migurski mentioned this pull request May 5, 2025

[Parquet-raster] Design CRS/SRID handing for Parquet Raster standard #262

Open

Merge branch 'main' into raster

0d0b913

cholmes merged commit c159476 into opengeospatial:main May 6, 2025
2 checks passed

		A raster is one or more grids of cells. All the grids should have `width` rows and `height` columns. The grid cells are represented by the `band` field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the `crs` field. For more details, please refer to the [CRS Customization](#crs-customization) section.

		The geo-referencing information is represented by the parameters of an affine transformation (`ip_x`, `ip_y`, `scale_x`, `scale_y`, `skew_x`, `skew_y`). This specification only supports affine transformation as geo-referencing transformation, other transformations such as polynomial transformation are not supported.

		* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored.


		## Metadata


		A raster column MUST be stored as a `struct` type column in parquet files. The `struct` type MUST contain the fields defined in the following table. The `raster` column MUST be stored in the root level of the parquet file.

		Each raster column must also have a corresponding `Geometry` or `Geography` type column, stored in the top level of the parquet file. The name of the geometry column MUST be specified in the `geometry` field of the raster column metadata.

Conversation

jiayuasu commented Apr 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

migurski Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mdsumner commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdsumner commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiayuasu May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

migurski Apr 29, 2025 •

edited

Loading

mdsumner commented Apr 28, 2025 •

edited

Loading

mdsumner commented Apr 29, 2025 •

edited

Loading

jiayuasu May 5, 2025 •

edited

Loading