Add the initial draft of the Parquet Raster standard#259
Add the initial draft of the Parquet Raster standard#259cholmes merged 6 commits intoopengeospatial:mainfrom
Conversation
| | `isOffline` | 1 bit | If true, data is found on external storage, through the path specified in `RASTERDATA`. | | ||
| | `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. | | ||
| | `isNodataValue` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. | | ||
| | `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. | |
There was a problem hiding this comment.
This naively replaces the reserved field of WKB raster: https://github.com/postgis/postgis/blob/master/raster/doc/RFC2-WellKnownBinaryFormat#L86
Open to better ideas!
| | `bandNumber` | int8 | 0-based band number to use from the set available in the external file. | | ||
| | `url` | string | The URI of the out-db raster file (e.g., GeoTIFF files) | | ||
|
|
||
| The allowed URI schemes are: |
There was a problem hiding this comment.
Putting some obvious schemes here. Open to new ideas!
| | `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. | | ||
| | `isNodataValue` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. | | ||
| | `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. | | ||
| | `pixtype` | 4 bits | Pixel type: <br>0: 1-bit boolean<br>1: 2-bit unsigned integer<br>2: 4-bit unsigned integer<br>3: 8-bit signed integer<br>4: 8-bit unsigned integer<br>5: 16-bit signed integer<br>6: 16-bit unsigned signed integer<br>7: 32-bit signed integer<br>8: 32-bit unsigned signed integer<br>10: 32-bit float<br>11: 64-bit float | |
There was a problem hiding this comment.
There should be a provision for 16-bit float. Cf https://gdal.org/en/latest/development/rfc/rfc100_float16_support.html
What about 64-bit signed/unsigned integers? They are a bit esoteric, but supported by GDAL
There was a problem hiding this comment.
If 1, 2, 4-bit integers are encoded on a full byte, perhaps pixtype should be decomposed in two separate fields: one to indicate the nature (signed integer, unsigned integer, IEEE floating-point) and another one the bit width. That way this could preserve a metadata information for e.g. 12-bit unsigned rasters.
| | `isOffline` | 1 bit | If true, data is found on external storage, through the path specified in `RASTERDATA`. | | ||
| | `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. | | ||
| | `isNodataValue` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. | | ||
| | `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. | |
There was a problem hiding this comment.
Having just a single bit for the compression method is not very future proof ? What about a full byte with an enumeration with just NONE and GZIP for now ?
There was a problem hiding this comment.
but what's the point of GZIP'ping given that Parquet natively supports GZIP ?
There was a problem hiding this comment.
Our customers typically upload these files to data warehouses where their uncompressed sizes are counted against user quotas. Compressing inside the data will allow customer table sizes to match file sizes more closely for sparse or indexed bands.
There was a problem hiding this comment.
It would be good to measure this, but I believe that most Parquet readers have to decompress whatever they read in pages of some number of values (I think thousands is the default but I'm not sure if this adapts to the number of bytes in them). Having some control over which values are decompressed could possibly be very important when reading (particularly when reading and then filtering); however, this has come up a number of times and so I think we need to quantify this for the next person who asks 🙂
| specified by a string of the format `type:identifier`, where `type` is one of | ||
| the following values: | ||
|
|
||
| * `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself. |
There was a problem hiding this comment.
SRID by itself doesn't mean much if you don't point to a spatial_ref_sys table...
There was a problem hiding this comment.
Another consideration is EPSG codes aren't necessarily stably defined over time. As an implementer, I would find the proposed approach frustrating because the definition of 4326 could be an ID or the PROJJSON that defines 4326. The latter might be more precisely defined than the former.
One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage. Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition? I would try to make the case that just PROJJSON is good enough. Plenty of discussion on the topic in GeoParquet's repo.
There was a problem hiding this comment.
I'm definitely sympathetic to inlining PROJJSON into each value because that's what I want for the world (pun slightly intended). If we do that, we should quantify the overhead we're asking for (a naive generation of PROJJSON for EPSG:3857 gives me 2200 bytes, a 512x512 png with 10 points is 10,000 bytes, and with 1000 points it's 100000 bytes. 2% overhead is probably acceptable but 20% overhead seems worth offering an alternative for?). PROJJSON in file metadata seems like reasonable middle ground?
There was a problem hiding this comment.
A user can choose to reference a particular table
How would they reference a table? I don't see a way to do that in the spec. Last time we got deep into this I did wonder if we should just publish a table of what PostGIS does somewhere stable and then reference it. All the database systems (and db-centered formats like Geopackage and Iceberg) work fine with SRID's since it's easy for them to throw in an extra table somewhere. But with parquet we don't have that luxury, and I think just going 'by convention' isn't actually robust enough - we need to define that convention somewhere.
but in general values like 4326, 3857, or 32610 could be interpreted by convention.
If we don't want to define a full table somewhere then I do think we should define these common ones in the spec, so everyone is on the same page as to what they are. I agree that the majority of data is covered by a small number of these, so I like the idea of making the majority case as easy as possible.
But I also do like things to be unambiguous, giving implementors just one route, so do lean a bit towards just picking one.
There was a problem hiding this comment.
One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage.
I always liked this idea, but unfortunately I don't know of any implementations that are actually doing this, and when I've tried to encourage people (like Java people or Esri), it hasn't been compelling at all to them. Maybe we could do more to explain it within the spec, like lay out the common ones and what people should do...
Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition?
I don't regret doing PROJJSON in Geoparquet, but it's clearly the decision that's given everyone the most heartache (though the alternatives all seemed less good). I do think a complete projjson here is a less of a slam dunk to me, as all the other metadata isn't already json, which was the case in geoparquet. And I am sympathetic to just doing something really simple for the most common CRS's, so people can just glance at the field and understand what it is.
There was a problem hiding this comment.
PROJJSON in file metadata seems like reasonable middle ground?
Are you saying that projjson would be in the file metadata as the default, and then at the row level people could override that?
The satellite imagery use case requires us to have some row level CRS...
There was a problem hiding this comment.
Classic problem that's been cycling for a long time
There was a problem hiding this comment.
Like, having a binary geometry include the CRS so every (row) geometry is self-contained is what EWKB(?), and the old Manifold binary, it's the same as an encoded (tile/chunk/row/array), I think we're post-selfcontained era, that was for spitting gis data around the internet, we wouldn't require a tile(chunk/row/array) to represent its bbox/transform or position in a larger logical array. These are container (layer) metadata
There was a problem hiding this comment.
Not sure I have the full answer here. It may be that if srid is stored in native Parquet that the Parquet compression will take care of most of the overhead (or it may even be dictionary-encoded automatically!).
| | `ip_y` | `DOUBLE` | | **REQUIRED.** The Y coordinate of the upper left corner of the raster | | ||
| | `skew_x` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in X direction | | ||
| | `skew_y` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in Y direction | | ||
| | `width` | `INT32` | | **REQUIRED.** The width of the raster in pixels | |
There was a problem hiding this comment.
This proposal is restricted to 2D rasters. Is this is a conscious choice to not allow n-D rasters such as permitted by Zarr or netCDF ?
There was a problem hiding this comment.
+1. We want to allow n-D rasters! We should definitely figure out how to incorporate them in this spec.
| A raster is one or more grids of cells. All the grids should have `width` rows and `height` columns. The grid cells are represented by the `band` field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the `crs` field. For more details, please refer to the [CRS Customization](#crs-customization) section. | ||
|
|
||
| The geo-referencing information is represented by the parameters of an affine transformation (`ip_x`, `ip_y`, `scale_x`, `scale_y`, `skew_x`, `skew_y`). This specification only supports affine transformation as geo-referencing transformation, other transformations such as polynomial transformation are not supported. | ||
|
|
There was a problem hiding this comment.
Do we need to explicitly define the affine formula to be more clear?
world_x = ip_x + (col + 0.5)*scale_x + (row + 0.5)*skew_x
world_y = ip_y + (col + 0.5)*skew_y + (row + 0.5)*scale_y
- col = the column number (pixel index) from the left (0 is the first/leftmost column)
- row = the row number (pixel index) from the top (0 is the first/topmost row)
There was a problem hiding this comment.
+1.
We may also want to consider aligning with STAC, see https://github.com/stac-extensions/projection - we just define shape and transform instead of the col, x, etc. Planet's data api did things that way, and the stac way felt like an improvement.
There was a problem hiding this comment.
Or go from 6 double columns to a single double[] array using the GDAL geotransform standard for what each entry in the array is? More opaque to read, but easier to slam into GDAL.
|
Are these "tiled tables"? I.e. encoded arrays in blob columns? I don't get that clarity from this document, though it it is in the wkb doc, I think that should be described here : what records the pixel orientation and arrangement? and I think it would be great to describe related schemes like gpkg, postgis, manifold.net, probably tiledb, and Zarr (materialized or virtual). I don't really get the arrangement of pixel values yet but would love to help couch this in those broader terms. |
|
fwiw I've answered mostly my questions by exploring the structure in raquet examples: https://github.com/CartoDB/raquet/tree/master/examples, certainly it's straightforward to unpack the blocks from the arrow memory and see how that works (I'm going to work harder to follow @rouault and others suggestions here). One thing that I think would be amazing would be to generalize like virtual Zarr has to encoded arrays that are referenced in the table but can live "anywhere" either in trad Zarr chunks or within legacy files (or anything). Appreciate that's way out of scope here probably but that seems to be the opportunity ahead of us generally. (thanks!) |
| | Name | Type | Meaning | | ||
| |--------------|-----------|-------------------------------------------------------------------------| | ||
| | `bandNumber` | int8 | 0-based band number to use from the set available in the external file. | | ||
| | `url` | string | The URI of the out-db raster file (e.g., GeoTIFF files) | |
There was a problem hiding this comment.
I think we should flesh out 'e.g., GeoTIFF files' a lot more. What types of raster data do we expect clients to support? What happens when someone puts in a value that the client doesn't support? Do we expect them to figure out if they can read it from the file name? Or are they expected to try to open it?
In STAC we don't just use a single url field, we have an object that also has a 'media type', see https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#working-with-media-types for info on how we use it. Clients can make use of that to figure out if they support it. We don't try to list all potential media types, but we share the common ones. An approach like that could potentiallyw ork here, but we'd need another field.
paleolimbot
left a comment
There was a problem hiding this comment.
We chatted about this a bit at the sync on Monday, but I wonder if there is a way to reduce the fields in the Parquet spec to purely what would be queried and inline the rest into either the binary representation or some JSON metadata. For argument's sake:
- Other
- Top-level
- Parquet
- Fields
- raster: struct
- covering: geometry
- encoding: string (one of more well-thought out versions of 'uri', 'wkbraster', 'gzippedwkbraster', 'cartosnumpything')
- metadata: string (JSON with crs, shape, and anything else that might be required to implement deferred loading according to what you've specified here)
- data: binary (depends on encoding)
I'm sure the right answer is somewhere between putting everything into a fixed schema like this and expanding it all into Parquet...just thought I'd throw that out there!
| * `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored. | ||
|
|
||
|
|
||
| ## Metadata |
There was a problem hiding this comment.
Is any of this essential? File-level metadata can be difficult to incorporate into a Parquet reader/writer depending on the API it exposes and if it's not required I wonder if we should omit it to simplify reading and writing
There was a problem hiding this comment.
I think none of them are essential. The idea of these file-level metadata is for geo-aware Parquet readers who actually understand geo and raster rather than generic Parquet readers. These tools can do better optimization if they can detect the Parquet Raster column.
Generic Parquet readers will work fine without knowing this column is a so-called raster type.
|
|
||
| A raster column MUST be stored as a `struct` type column in parquet files. The `struct` type MUST contain the fields defined in the following table. The `raster` column MUST be stored in the root level of the parquet file. | ||
|
|
||
| Each raster column must also have a corresponding `Geometry` or `Geography` type column, stored in the top level of the parquet file. The name of the geometry column MUST be specified in the `geometry` field of the raster column metadata. |
There was a problem hiding this comment.
I wonder if this would be just as effective (or more) if included in the struct. The statistics should also be available there and keeping it together is helpful when wrapping a file reader/writer interface (some of them are made easier if there's a 1:1 between the source/destination datatype and the parquet datatype)
+1 - I liked your suggestion of considering whether people actually need to query on it, and if not then it doesn't need to be in the parquet. |
|
What do others think about just merging this pretty soon, and then additional discussions could take place as individual PR's, instead of just having tons of little comment threads? Like just mark it as work in progress at the top of the doc. |
I agree. Now it is a bit hard to track all the comments in different threads. Let's discuss issues in individual PRs then? I added |
| Parquet Raster files include additional metadata at two levels: | ||
|
|
||
| 1. File metadata indicating things like the version of this specification used | ||
| 2. Column metadata with additional metadata for each raster column |
There was a problem hiding this comment.
We need to at least specify the key for storing these metadata. For instance, "A GeoParquet file MUST include a geo.raster key in the Parquet metadata".
cholmes
left a comment
There was a problem hiding this comment.
I'm going to approve this, so we can merge and then move the conversation to individual PR's.
|
Sounds like the right move! |
Co-authored-by: Kristin Cowalcijk <bo@wherobots.com>
|
After the PR merged, I will create individual GitHub issues referring to the comments in this PR |
|
Big topics for followup issues:
|
|
Merging this in, as discussed above and at the last zoom meeting. Discussion to continue in individual issues and PR's. |
Parquet Raster aims to define a standard for storing raster imagery data in the Parquet format, making it easier for existing cloud data warehouses to read, write, and exchange raster data at scale.
This effort is initiated by the GeoParquet community, with contributions from companies such as Carto and Wherobots. The specification is largely inspired by the PostGIS WKB Raster encoding [1], and incorporates ideas from existing efforts like Carto’s RaQuet [2] and Wherobots’ Havasu Iceberg specification [3].
This is still a work-in-progress. I am creating this PR to gather feedback. Many people have contributed to the early discussions, including:
• @migurski from Carto
• @kylebarron from DevSeed
• @pramsey from PostGIS
• @jiayuasu , @paleolimbot , and @Kontinuation from Wherobots
Key highlights:
1. Raster metadata is stored using Parquet struct types, while pixel data is stored as binary.
2. Supports both in-database (in-DB) and out-of-database (out-DB) storage.
3. A raster column must be accompanied by a geometry column to enable spatial predicate pushdown.
4. Supports optional additional GZIP compression for in-database pixel storage.
[1] WKB raster encoding: https://github.com/postgis/postgis/blob/master/raster/doc/RFC2-WellKnownBinaryFormat
[2] Carto RaQuet spec: https://github.com/CartoDB/raquet
[3] Wherobots Havasu Iceberg spec: https://github.com/wherobots/havasu/blob/main/spec.md