Skip to content

Add the initial draft of the Parquet Raster standard#259

Merged
cholmes merged 6 commits intoopengeospatial:mainfrom
jiayuasu:raster
May 6, 2025
Merged

Add the initial draft of the Parquet Raster standard#259
cholmes merged 6 commits intoopengeospatial:mainfrom
jiayuasu:raster

Conversation

@jiayuasu
Copy link
Copy Markdown
Collaborator

Parquet Raster aims to define a standard for storing raster imagery data in the Parquet format, making it easier for existing cloud data warehouses to read, write, and exchange raster data at scale.

This effort is initiated by the GeoParquet community, with contributions from companies such as Carto and Wherobots. The specification is largely inspired by the PostGIS WKB Raster encoding [1], and incorporates ideas from existing efforts like Carto’s RaQuet [2] and Wherobots’ Havasu Iceberg specification [3].

This is still a work-in-progress. I am creating this PR to gather feedback. Many people have contributed to the early discussions, including:
@migurski from Carto
@kylebarron from DevSeed
@pramsey from PostGIS
@jiayuasu , @paleolimbot , and @Kontinuation from Wherobots

Key highlights:
1. Raster metadata is stored using Parquet struct types, while pixel data is stored as binary.
2. Supports both in-database (in-DB) and out-of-database (out-DB) storage.
3. A raster column must be accompanied by a geometry column to enable spatial predicate pushdown.
4. Supports optional additional GZIP compression for in-database pixel storage.

[1] WKB raster encoding: https://github.com/postgis/postgis/blob/master/raster/doc/RFC2-WellKnownBinaryFormat
[2] Carto RaQuet spec: https://github.com/CartoDB/raquet
[3] Wherobots Havasu Iceberg spec: https://github.com/wherobots/havasu/blob/main/spec.md

Comment thread raster/parquet-raster.md Outdated
| `isOffline` | 1 bit | If true, data is found on external storage, through the path specified in `RASTERDATA`. |
| `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. |
| `isNodataValue` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. |
| `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. |
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naively replaces the reserved field of WKB raster: https://github.com/postgis/postgis/blob/master/raster/doc/RFC2-WellKnownBinaryFormat#L86

Open to better ideas!

Comment thread raster/parquet-raster.md
| `bandNumber` | int8 | 0-based band number to use from the set available in the external file. |
| `url` | string | The URI of the out-db raster file (e.g., GeoTIFF files) |

The allowed URI schemes are:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting some obvious schemes here. Open to new ideas!

Comment thread raster/parquet-raster.md
Comment thread raster/parquet-raster.md Outdated
Comment thread raster/parquet-raster.md Outdated
| `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. |
| `isNodataValue` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. |
| `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. |
| `pixtype` | 4 bits | Pixel type: <br>0: 1-bit boolean<br>1: 2-bit unsigned integer<br>2: 4-bit unsigned integer<br>3: 8-bit signed integer<br>4: 8-bit unsigned integer<br>5: 16-bit signed integer<br>6: 16-bit unsigned signed integer<br>7: 32-bit signed integer<br>8: 32-bit unsigned signed integer<br>10: 32-bit float<br>11: 64-bit float |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a provision for 16-bit float. Cf https://gdal.org/en/latest/development/rfc/rfc100_float16_support.html
What about 64-bit signed/unsigned integers? They are a bit esoteric, but supported by GDAL

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 1, 2, 4-bit integers are encoded on a full byte, perhaps pixtype should be decomposed in two separate fields: one to indicate the nature (signed integer, unsigned integer, IEEE floating-point) and another one the bit width. That way this could preserve a metadata information for e.g. 12-bit unsigned rasters.

Comment thread raster/parquet-raster.md Outdated
Comment thread raster/parquet-raster.md Outdated
| `isOffline` | 1 bit | If true, data is found on external storage, through the path specified in `RASTERDATA`. |
| `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. |
| `isNodataValue` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. |
| `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having just a single bit for the compression method is not very future proof ? What about a full byte with an enumeration with just NONE and GZIP for now ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what's the point of GZIP'ping given that Parquet natively supports GZIP ?

Copy link
Copy Markdown

@migurski migurski Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our customers typically upload these files to data warehouses where their uncompressed sizes are counted against user quotas. Compressing inside the data will allow customer table sizes to match file sizes more closely for sparse or indexed bands.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to measure this, but I believe that most Parquet readers have to decompress whatever they read in pages of some number of values (I think thousands is the default but I'm not sure if this adapts to the number of bytes in them). Having some control over which values are decompressed could possibly be very important when reading (particularly when reading and then filtering); however, this has come up a number of times and so I think we need to quantify this for the next person who asks 🙂

Comment thread raster/parquet-raster.md Outdated
Comment thread raster/parquet-raster.md Outdated
specified by a string of the format `type:identifier`, where `type` is one of
the following values:

* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRID by itself doesn't mean much if you don't point to a spatial_ref_sys table...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this and determined that common EPSG values are the most likely ones to appear here. A user can choose to reference a particular table, but in general values like 4326, 3857, or 32610 could be interpreted by convention.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another consideration is EPSG codes aren't necessarily stably defined over time. As an implementer, I would find the proposed approach frustrating because the definition of 4326 could be an ID or the PROJJSON that defines 4326. The latter might be more precisely defined than the former.

One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage. Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition? I would try to make the case that just PROJJSON is good enough. Plenty of discussion on the topic in GeoParquet's repo.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely sympathetic to inlining PROJJSON into each value because that's what I want for the world (pun slightly intended). If we do that, we should quantify the overhead we're asking for (a naive generation of PROJJSON for EPSG:3857 gives me 2200 bytes, a 512x512 png with 10 points is 10,000 bytes, and with 1000 points it's 100000 bytes. 2% overhead is probably acceptable but 20% overhead seems worth offering an alternative for?). PROJJSON in file metadata seems like reasonable middle ground?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user can choose to reference a particular table

How would they reference a table? I don't see a way to do that in the spec. Last time we got deep into this I did wonder if we should just publish a table of what PostGIS does somewhere stable and then reference it. All the database systems (and db-centered formats like Geopackage and Iceberg) work fine with SRID's since it's easy for them to throw in an extra table somewhere. But with parquet we don't have that luxury, and I think just going 'by convention' isn't actually robust enough - we need to define that convention somewhere.

but in general values like 4326, 3857, or 32610 could be interpreted by convention.

If we don't want to define a full table somewhere then I do think we should define these common ones in the spec, so everyone is on the same page as to what they are. I agree that the majority of data is covered by a small number of these, so I like the idea of making the majority case as easy as possible.

But I also do like things to be unambiguous, giving implementors just one route, so do lean a bit towards just picking one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage.

I always liked this idea, but unfortunately I don't know of any implementations that are actually doing this, and when I've tried to encourage people (like Java people or Esri), it hasn't been compelling at all to them. Maybe we could do more to explain it within the spec, like lay out the common ones and what people should do...

Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition?

I don't regret doing PROJJSON in Geoparquet, but it's clearly the decision that's given everyone the most heartache (though the alternatives all seemed less good). I do think a complete projjson here is a less of a slam dunk to me, as all the other metadata isn't already json, which was the case in geoparquet. And I am sympathetic to just doing something really simple for the most common CRS's, so people can just glance at the field and understand what it is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROJJSON in file metadata seems like reasonable middle ground?

Are you saying that projjson would be in the file metadata as the default, and then at the row level people could override that?

The satellite imagery use case requires us to have some row level CRS...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classic problem that's been cycling for a long time

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, having a binary geometry include the CRS so every (row) geometry is self-contained is what EWKB(?), and the old Manifold binary, it's the same as an encoded (tile/chunk/row/array), I think we're post-selfcontained era, that was for spitting gis data around the internet, we wouldn't require a tile(chunk/row/array) to represent its bbox/transform or position in a larger logical array. These are container (layer) metadata

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I have the full answer here. It may be that if srid is stored in native Parquet that the Parquet compression will take care of most of the overhead (or it may even be dictionary-encoded automatically!).

Comment thread raster/parquet-raster.md
| `ip_y` | `DOUBLE` | | **REQUIRED.** The Y coordinate of the upper left corner of the raster |
| `skew_x` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in X direction |
| `skew_y` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in Y direction |
| `width` | `INT32` | | **REQUIRED.** The width of the raster in pixels |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is restricted to 2D rasters. Is this is a conscious choice to not allow n-D rasters such as permitted by Zarr or netCDF ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We want to allow n-D rasters! We should definitely figure out how to incorporate them in this spec.

Comment thread raster/parquet-raster.md Outdated
Comment thread raster/parquet-raster.md Outdated
Comment thread raster/parquet-raster.md Outdated
A raster is one or more grids of cells. All the grids should have `width` rows and `height` columns. The grid cells are represented by the `band` field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the `crs` field. For more details, please refer to the [CRS Customization](#crs-customization) section.

The geo-referencing information is represented by the parameters of an affine transformation (`ip_x`, `ip_y`, `scale_x`, `scale_y`, `skew_x`, `skew_y`). This specification only supports affine transformation as geo-referencing transformation, other transformations such as polynomial transformation are not supported.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly define the affine formula to be more clear?

world_x = ip_x + (col + 0.5)*scale_x + (row + 0.5)*skew_x  
world_y = ip_y + (col + 0.5)*skew_y + (row + 0.5)*scale_y
  • col = the column number (pixel index) from the left (0 is the first/leftmost column)
  • row = the row number (pixel index) from the top (0 is the first/topmost row)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

We may also want to consider aligning with STAC, see https://github.com/stac-extensions/projection - we just define shape and transform instead of the col, x, etc. Planet's data api did things that way, and the stac way felt like an improvement.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or go from 6 double columns to a single double[] array using the GDAL geotransform standard for what each entry in the array is? More opaque to read, but easier to slam into GDAL.

Comment thread format-specs/parquet-raster.md Outdated
Comment thread format-specs/parquet-raster.md Outdated
@mdsumner
Copy link
Copy Markdown

mdsumner commented Apr 28, 2025

Are these "tiled tables"? I.e. encoded arrays in blob columns? I don't get that clarity from this document, though it it is in the wkb doc, I think that should be described here : what records the pixel orientation and arrangement? and I think it would be great to describe related schemes like gpkg, postgis, manifold.net, probably tiledb, and Zarr (materialized or virtual). I don't really get the arrangement of pixel values yet but would love to help couch this in those broader terms.

@mdsumner
Copy link
Copy Markdown

mdsumner commented Apr 29, 2025

fwiw I've answered mostly my questions by exploring the structure in raquet examples: https://github.com/CartoDB/raquet/tree/master/examples, certainly it's straightforward to unpack the blocks from the arrow memory and see how that works (I'm going to work harder to follow @rouault and others suggestions here). One thing that I think would be amazing would be to generalize like virtual Zarr has to encoded arrays that are referenced in the table but can live "anywhere" either in trad Zarr chunks or within legacy files (or anything). Appreciate that's way out of scope here probably but that seems to be the opportunity ahead of us generally. (thanks!)

Comment thread format-specs/parquet-raster.md Outdated
| Name | Type | Meaning |
|--------------|-----------|-------------------------------------------------------------------------|
| `bandNumber` | int8 | 0-based band number to use from the set available in the external file. |
| `url` | string | The URI of the out-db raster file (e.g., GeoTIFF files) |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should flesh out 'e.g., GeoTIFF files' a lot more. What types of raster data do we expect clients to support? What happens when someone puts in a value that the client doesn't support? Do we expect them to figure out if they can read it from the file name? Or are they expected to try to open it?

In STAC we don't just use a single url field, we have an object that also has a 'media type', see https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#working-with-media-types for info on how we use it. Clients can make use of that to figure out if they support it. We don't try to list all potential media types, but we share the common ones. An approach like that could potentiallyw ork here, but we'd need another field.

Copy link
Copy Markdown
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this a bit at the sync on Monday, but I wonder if there is a way to reduce the fields in the Parquet spec to purely what would be queried and inline the rest into either the binary representation or some JSON metadata. For argument's sake:

  • Other
  • Top-level
  • Parquet
  • Fields
  • raster: struct
    • covering: geometry
    • encoding: string (one of more well-thought out versions of 'uri', 'wkbraster', 'gzippedwkbraster', 'cartosnumpything')
    • metadata: string (JSON with crs, shape, and anything else that might be required to implement deferred loading according to what you've specified here)
    • data: binary (depends on encoding)

I'm sure the right answer is somewhere between putting everything into a fixed schema like this and expanding it all into Parquet...just thought I'd throw that out there!

* `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored.


## Metadata
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any of this essential? File-level metadata can be difficult to incorporate into a Parquet reader/writer depending on the API it exposes and if it's not required I wonder if we should omit it to simplify reading and writing

Copy link
Copy Markdown
Collaborator Author

@jiayuasu jiayuasu May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think none of them are essential. The idea of these file-level metadata is for geo-aware Parquet readers who actually understand geo and raster rather than generic Parquet readers. These tools can do better optimization if they can detect the Parquet Raster column.

Generic Parquet readers will work fine without knowing this column is a so-called raster type.


A raster column MUST be stored as a `struct` type column in parquet files. The `struct` type MUST contain the fields defined in the following table. The `raster` column MUST be stored in the root level of the parquet file.

Each raster column must also have a corresponding `Geometry` or `Geography` type column, stored in the top level of the parquet file. The name of the geometry column MUST be specified in the `geometry` field of the raster column metadata.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would be just as effective (or more) if included in the struct. The statistics should also be available there and keeping it together is helpful when wrapping a file reader/writer interface (some of them are made easier if there's a 1:1 between the source/destination datatype and the parquet datatype)

@cholmes
Copy link
Copy Markdown
Member

cholmes commented Apr 30, 2025

I wonder if there is a way to reduce the fields in the Parquet spec to purely what would be queried and inline the rest into either the binary representation or some JSON metadata.

+1 - I liked your suggestion of considering whether people actually need to query on it, and if not then it doesn't need to be in the parquet.

@cholmes
Copy link
Copy Markdown
Member

cholmes commented May 4, 2025

What do others think about just merging this pretty soon, and then additional discussions could take place as individual PR's, instead of just having tons of little comment threads? Like just mark it as work in progress at the top of the doc.

@jiayuasu jiayuasu changed the title [WIP] Create Parquet Raster standard Add the initial draft of the Parquet Raster standard May 5, 2025
@jiayuasu
Copy link
Copy Markdown
Collaborator Author

jiayuasu commented May 5, 2025

@cholmes

What do others think about just merging this pretty soon, and then additional discussions could take place as individual PR's, instead of just having tons of little comment threads? Like just mark it as work in progress at the top of the doc.

I agree. Now it is a bit hard to track all the comments in different threads. Let's discuss issues in individual PRs then?

I added Work in progress at the top of the doc.

Comment on lines +112 to +115
Parquet Raster files include additional metadata at two levels:

1. File metadata indicating things like the version of this specification used
2. Column metadata with additional metadata for each raster column
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to at least specify the key for storing these metadata. For instance, "A GeoParquet file MUST include a geo.raster key in the Parquet metadata".

Comment thread format-specs/parquet-raster.md Outdated
Copy link
Copy Markdown
Member

@cholmes cholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to approve this, so we can merge and then move the conversation to individual PR's.

@migurski
Copy link
Copy Markdown

migurski commented May 5, 2025

Sounds like the right move!

@jiayuasu jiayuasu marked this pull request as ready for review May 5, 2025 18:06
Co-authored-by: Kristin Cowalcijk <bo@wherobots.com>
Copy link
Copy Markdown
Collaborator

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@jiayuasu
Copy link
Copy Markdown
Collaborator Author

jiayuasu commented May 5, 2025

After the PR merged, I will create individual GitHub issues referring to the comments in this PR

@migurski
Copy link
Copy Markdown

migurski commented May 5, 2025

Big topics for followup issues:

  • Compression
  • SRID / CRS

@cholmes
Copy link
Copy Markdown
Member

cholmes commented May 6, 2025

Merging this in, as discussed above and at the last zoom meeting. Discussion to continue in individual issues and PR's.

@cholmes cholmes merged commit c159476 into opengeospatial:main May 6, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants