Skip to content

[Postgres] Streaming from/to BYTEA column type #293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ufoscout opened this issue May 1, 2020 · 7 comments
Closed

[Postgres] Streaming from/to BYTEA column type #293

ufoscout opened this issue May 1, 2020 · 7 comments

Comments

@ufoscout
Copy link

ufoscout commented May 1, 2020

Is it possible to insert into and read from a BYTEA column without fetching the entire binary data in memory?

For example, I have this table:

create table SAMPLE_BINARY_STORE (
    filename    TEXT NOT NULL,
    data  BYTEA
);

I would like to:

  • INSERT a new row into the table by streaming the data from an async source (e.g. a file on the client filesystem or an HTTP request, etc.)
  • perform a SELECT by filename and stream the data directly into an async destination (e.g. a file, an HTTP response) without fetching the entire data in memory.

Is there a way to achieve it?
Would my issue be solved by #36 ?

Please note that I cannot use Postgres Large Objects.

@abonander
Copy link
Collaborator

The problem with streaming requests/responses with Postgres is that it explicitly recommends to fully buffer messages before sending them or processing them (Messaging Overview section):

To avoid losing synchronization with the message stream, both servers and clients typically read an entire message into a buffer (using the byte count) before attempting to process its contents. This allows easy recovery if an error is detected while processing the contents. In extreme situations (such as not having enough memory to buffer the message), the receiver can use the byte count to determine how much input to skip before it resumes reading messages.

Conversely, both servers and clients must take care never to send an incomplete message. This is commonly done by marshaling the entire message in a buffer before beginning to send it. If a communications failure occurs partway through sending or receiving a message, the only sensible response is to abandon the connection, since there is little hope of recovering message-boundary synchronization.

Because messages require a byte-length header, we would need to know the exact size of the stream we're reading into Postgres, and more importantly be able to trust that size.

I'm all for discussing solutions here, but honestly if your files are too big to buffer in memory then you might consider storing them outside of Postgres; it's not meant for extremely large blobs.

@mehcode
Copy link
Member

mehcode commented May 1, 2020

I know you mention you can't use binary large objects, but do you mind if you share context? They're explicitly designed for exactly this, streaming huge amounts of data to a single column.

@abonander
Copy link
Collaborator

The Large-Object interface would actually be pretty neat to implement as an async interface. It's required in a transaction so it could be an impl<C> Transaction where C: DerefMut<Target = PgConnection> {}

@ufoscout
Copy link
Author

ufoscout commented May 9, 2020

do you mind if you share context?

While discussing a new project with a colleague of mine, he affirmed that large-object in Postgres should be avoided if possible. Some of his arguments are that:

  • all large objects are saved in the very same table so the scalability of the db is limited
  • in some cases, large-objects are ignored by pg_dump and they could be excluded by the automatic backups of a managed db (e.g. in gcloud or aws)
  • large-objects requires extra db maintenance (not sure what/why)
  • large objects performance are not on pair with BYTEA

I am not a db manager and I am trying to clarify what is real about those claims and if there is an alternative path.

Having read in the doc that sqlx offers "Row streaming. Data is read asynchronously from the database and decoded on-demand.", I wanted to check whether it could help me streaming from/to BYTEA columns.

@abonander
Copy link
Collaborator

abonander commented May 11, 2020

Streaming output is done on a row-by-row basis since they arrive in self-contained packets. Anything more granular than that (e.g. for individual columns) would require a significant refactor of the Postgres driver. Streaming input would, on top of also being a significant refactor, also be very hazard-prone because if the source stream is fallible the only way to recover from an error is to close the connection; the normal Postgres binary protocol has no provision for canceling incomplete messages.

However, the protocol in COPY FROM STDIN/COPY TO STDIN mode can handle incomplete streams, and is probably the best way to implement this: https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY

This has its own challenges though, which we've been discussing in #36

@abonander
Copy link
Collaborator

Even for COPY FROM STDIN/COPY TO STDIN we still have the problem that each row needs a header with the exact number of bytes of data it contains.

@mehcode
Copy link
Member

mehcode commented May 31, 2020

Closing as this isn't directly possible.

@mehcode mehcode closed this as completed May 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants