-
Notifications
You must be signed in to change notification settings - Fork 65
Add the initial draft of the Parquet Raster standard #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
abb03ed
b775cfc
fda1a63
0b48b8d
5f3f534
0d0b913
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| # Parquet Raster Specification | ||
|
|
||
| ## Overview | ||
|
|
||
| The [Apache Parquet](https://parquet.apache.org/) provides a standardized open-source columnar storage format and it also natively supports geo types (i.e., Geometry and Geography types). The Parquet Raster specification defines how geo-referenced raster imagery data (abbr., raster) should be stored in parquet format, including the representation of raster and the required additional metadata. | ||
|
|
||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.ietf.org/rfc/rfc2119.txt). | ||
|
|
||
| ## Raster columns | ||
|
|
||
| A raster column MUST be stored as a `struct` type column in parquet files. The `struct` type MUST contain the fields defined in the following table. The `raster` column MUST be stored in the root level of the parquet file. | ||
|
|
||
| Each raster column must also have a corresponding `Geometry` or `Geography` type column, stored in the top level of the parquet file. The name of the geometry column MUST be specified in the `geometry` field of the raster column metadata. | ||
|
|
||
| ## Raster Representation | ||
|
|
||
| The raster data model is largely inspired by the WKB raster encoding of PostGIS but extracts the raster metadata out of the binary encoding. | ||
|
|
||
| ### Raster value | ||
|
|
||
| A raster value is composed by the following components: | ||
|
|
||
| | Field | Parquet Physical Type | Parquet Logical Type | Description | | ||
| |--------------|-----------------------|----------------------|-------------------------------------------------------------------------| | ||
| | `endianness` | `boolean` | | **REQUIRED.** True: little endian; False: big endian | | ||
| | `crs` | `BYTE_ARRAY` | UTF8 | **OPTIONAL.** The coordinate reference system of the raster | | ||
| | `scale_x` | `DOUBLE` | | **REQUIRED.** The scale factor of the raster in X direction | | ||
| | `scale_y` | `DOUBLE` | | **REQUIRED.** The scale factor of the raster in Y direction | | ||
| | `ip_x` | `DOUBLE` | | **REQUIRED.** The X coordinate of the upper left corner of the raster | | ||
| | `ip_y` | `DOUBLE` | | **REQUIRED.** The Y coordinate of the upper left corner of the raster | | ||
| | `skew_x` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in X direction | | ||
| | `skew_y` | `DOUBLE` | | **REQUIRED.** The skew factor of the raster in Y direction | | ||
| | `width` | `INT32` | | **REQUIRED.** The width of the raster in pixels | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This proposal is restricted to 2D rasters. Is this is a conscious choice to not allow n-D rasters such as permitted by Zarr or netCDF ?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1. We want to allow n-D rasters! We should definitely figure out how to incorporate them in this spec. |
||
| | `height` | `INT32` | | **REQUIRED.** The height of the raster in pixels | | ||
| | `bands` | `BYTE_ARRAY` | List<BYTE_ARRAY> | **REQUIRED.** The bands of the raster. See the band data encoding below | | ||
|
jiayuasu marked this conversation as resolved.
|
||
|
|
||
| A raster is one or more grids of cells. All the grids should have `width` rows and `height` columns. The grid cells are represented by the `band` field. The grids are geo-referenced using an affine transformation that maps the grid coordinates to world coordinates. The coordinate reference system (CRS) of the world coordinates is specified by the `crs` field. For more details, please refer to the [CRS Customization](#crs-customization) section. | ||
|
|
||
| The geo-referencing information is represented by the parameters of an affine transformation (`ip_x`, `ip_y`, `scale_x`, `scale_y`, `skew_x`, `skew_y`). This specification only supports affine transformation as geo-referencing transformation, other transformations such as polynomial transformation are not supported. | ||
|
|
||
| The grid coordinates of a raster is always anchored at the center of grid cells. The translation factor of the affine transformation `ip_x` and `ip_y` also designates the world coordinate of the center of the upper left grid cell. | ||
|
|
||
| This specification supports persisting raster band values in two different ways specified by the `isOffline` flag in the band data encoding. The two options are: | ||
|
|
||
| * **in-db**: The band values are stored in the same Parquet file as the geo-referencing information. | ||
| * **out-db**: The band values are stored in files external to the Parquet file. | ||
|
|
||
| ### Band data encoding | ||
|
|
||
| | Name | Type | Meaning | | ||
|
jiayuasu marked this conversation as resolved.
Outdated
|
||
| |------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | `isOffline` | 1 bit | If true, data is found on external storage, through the path specified in `RASTERDATA`. | | ||
| | `hasNodataValue` | 1 bit | If true, the stored nodata value is a true nodata value. Otherwise, the nodata value should be ignored. | | ||
| | `isNodataValue` | 1 bit | If true, all values of the band are expected to be nodata values. This is a dirty flag; to set it properly, the function `st_bandisnodata` must be called with `TRUE` as the last argument. | | ||
|
jiayuasu marked this conversation as resolved.
Outdated
|
||
| | `isGZIPPed` | 1 bit | If true, the data is compressed using GZIP before being passed to the Parquet compression process. | | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This naively replaces the Open to better ideas!
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having just a single bit for the compression method is not very future proof ? What about a full byte with an enumeration with just NONE and GZIP for now ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but what's the point of GZIP'ping given that Parquet natively supports GZIP ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our customers typically upload these files to data warehouses where their uncompressed sizes are counted against user quotas. Compressing inside the data will allow customer table sizes to match file sizes more closely for sparse or indexed bands.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be good to measure this, but I believe that most Parquet readers have to decompress whatever they read in pages of some number of values (I think thousands is the default but I'm not sure if this adapts to the number of bytes in them). Having some control over which values are decompressed could possibly be very important when reading (particularly when reading and then filtering); however, this has come up a number of times and so I think we need to quantify this for the next person who asks 🙂 |
||
| | `pixtype` | 4 bits | Pixel type: <br>0: 1-bit boolean<br>1: 2-bit unsigned integer<br>2: 4-bit unsigned integer<br>3: 8-bit signed integer<br>4: 8-bit unsigned integer<br>5: 16-bit signed integer<br>6: 16-bit unsigned signed integer<br>7: 32-bit signed integer<br>8: 32-bit unsigned signed integer<br>10: 32-bit float<br>11: 64-bit float | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be a provision for 16-bit float. Cf https://gdal.org/en/latest/development/rfc/rfc100_float16_support.html
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If 1, 2, 4-bit integers are encoded on a full byte, perhaps pixtype should be decomposed in two separate fields: one to indicate the nature (signed integer, unsigned integer, IEEE floating-point) and another one the bit width. That way this could preserve a metadata information for e.g. 12-bit unsigned rasters.
jiayuasu marked this conversation as resolved.
Outdated
|
||
| | `nodata` | 1 to 8 bytes (depending on `pixtype` [1]) | Nodata value. | | ||
| | `data` | byte_array | Raster band pixel data (see below). | | ||
|
jiayuasu marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### In-DB pixel data encoding | ||
|
|
||
| This encoding is used when `isOffline` flag is false. | ||
|
|
||
| | Name | Type | Meaning | | ||
| |--------------|-----------------|---------| | ||
| | `pix[w*h]` | 1 to 8 bytes (depending on `pixtype` [1]) | Pixel values, row after row. `pix[0]` is the upper-left, `pix[w-1]` is the upper-right. <br><br>Endianness is specified at the start of WKB. It is implicit up to 8 bits (bit-order is most significant first). | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "bit-order is most significant first" not understanding this. For 1-bit data, does that mean that value 1 is encoded as (1 << 7) ? |
||
|
|
||
| ### Out-DB pixel data encoding | ||
|
|
||
| This encoding is used when `isOffline` flag is true. | ||
|
|
||
| | Name | Type | Meaning | | ||
| |--------------|-----------|-------------------------------------------------------------------------| | ||
| | `bandNumber` | int8 | 0-based band number to use from the set available in the external file. | | ||
| | `url` | string | The URI of the out-db raster file (e.g., GeoTIFF files) | | ||
|
jiayuasu marked this conversation as resolved.
Outdated
|
||
|
|
||
| The allowed URI schemes are: | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Putting some obvious schemes here. Open to new ideas! |
||
| * `file://`: Local file system | ||
| * `http://`: HTTP | ||
| * `https://`: HTTPS | ||
|
|
||
| --- | ||
|
|
||
| [1] Note: 1, 2, and 4 bit `pixtype`s are still encoded as 1 byte per value. | ||
|
|
||
| ### CRS Customization | ||
|
|
||
| CRS is represented as a string value. Writer and reader implementations are | ||
| responsible for serializing and deserializing the CRS, respectively. | ||
|
|
||
| As a convention to maximize the interoperability, custom CRS values can be | ||
| specified by a string of the format `type:identifier`, where `type` is one of | ||
| the following values: | ||
|
|
||
| * `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SRID by itself doesn't mean much if you don't point to a spatial_ref_sys table... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another consideration is EPSG codes aren't necessarily stably defined over time. As an implementer, I would find the proposed approach frustrating because the definition of 4326 could be an ID or the PROJJSON that defines 4326. The latter might be more precisely defined than the former. One of the stated design goals of PROJJSON was to make it convenient for applications to pluck the keys it might understand (like a 4326 or 3857) while also providing a complete definition that more sophisticated applications can fully leverage. Lazy mode is always pasting in something that already exists anyway, so why not paste a complete definition? I would try to make the case that just PROJJSON is good enough. Plenty of discussion on the topic in GeoParquet's repo.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm definitely sympathetic to inlining PROJJSON into each value because that's what I want for the world (pun slightly intended). If we do that, we should quantify the overhead we're asking for (a naive generation of PROJJSON for EPSG:3857 gives me 2200 bytes, a 512x512 png with 10 points is 10,000 bytes, and with 1000 points it's 100000 bytes. 2% overhead is probably acceptable but 20% overhead seems worth offering an alternative for?). PROJJSON in file metadata seems like reasonable middle ground?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
How would they reference a table? I don't see a way to do that in the spec. Last time we got deep into this I did wonder if we should just publish a table of what PostGIS does somewhere stable and then reference it. All the database systems (and db-centered formats like Geopackage and Iceberg) work fine with SRID's since it's easy for them to throw in an extra table somewhere. But with parquet we don't have that luxury, and I think just going 'by convention' isn't actually robust enough - we need to define that convention somewhere.
If we don't want to define a full table somewhere then I do think we should define these common ones in the spec, so everyone is on the same page as to what they are. I agree that the majority of data is covered by a small number of these, so I like the idea of making the majority case as easy as possible. But I also do like things to be unambiguous, giving implementors just one route, so do lean a bit towards just picking one.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I always liked this idea, but unfortunately I don't know of any implementations that are actually doing this, and when I've tried to encourage people (like Java people or Esri), it hasn't been compelling at all to them. Maybe we could do more to explain it within the spec, like lay out the common ones and what people should do...
I don't regret doing PROJJSON in Geoparquet, but it's clearly the decision that's given everyone the most heartache (though the alternatives all seemed less good). I do think a complete projjson here is a less of a slam dunk to me, as all the other metadata isn't already json, which was the case in geoparquet. And I am sympathetic to just doing something really simple for the most common CRS's, so people can just glance at the field and understand what it is.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Are you saying that projjson would be in the file metadata as the default, and then at the row level people could override that? The satellite imagery use case requires us to have some row level CRS... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Classic problem that's been cycling for a long time There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like, having a binary geometry include the CRS so every (row) geometry is self-contained is what EWKB(?), and the old Manifold binary, it's the same as an encoded (tile/chunk/row/array), I think we're post-selfcontained era, that was for spitting gis data around the internet, we wouldn't require a tile(chunk/row/array) to represent its bbox/transform or position in a larger logical array. These are container (layer) metadata
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure I have the full answer here. It may be that if |
||
| * `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property or a file property where the projjson string is stored. | ||
|
jiayuasu marked this conversation as resolved.
Outdated
|
||
|
|
||
|
|
||
| ## Metadata | ||
|
|
||
| Parquet Raster files include additional metadata at two levels: | ||
|
|
||
| 1. File metadata indicating things like the version of this specification used | ||
| 2. Column metadata with additional metadata for each raster column | ||
|
|
||
| ### File metadata | ||
|
|
||
| | Field Name | Type | Description | | ||
| | ------------------ | ------ |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | version | string | **REQUIRED.** The version identifier for the Parquet Raster specification. | | ||
| | primary_column | string | **REQUIRED.** The name of the "primary" raster column. In cases where a Parquet file contains multiple raster columns, the primary raster may be used by default in raster operations. | | ||
| | columns | object\<string, [Column Metadata](#column-metadata)> | **REQUIRED.** Metadata about raster columns. Each key is the name of a raster column in the table. | | ||
|
|
||
| At this level, additional implementation-specific fields (e.g. library name) MAY be present, and readers should be robust in ignoring those. | ||
|
|
||
| ### Column metadata | ||
|
|
||
| Each raster column in the dataset MUST be included in the `columns` field above with the following content, keyed by the column name: | ||
|
|
||
| | Field Name | Type | Description | | ||
| |------------| ------------ |------------------------------------------------------------------------------------------| | ||
| | geometry | string | **REQUIRED.** Name of the geo-reference column to help accelerate spatial data retrieval | | ||
Uh oh!
There was an error while loading. Please reload this page.