Skip to content

Conditional delete/update based on rows affected #2156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
steve-chavez opened this issue Feb 7, 2022 · 15 comments
Closed

Conditional delete/update based on rows affected #2156

steve-chavez opened this issue Feb 7, 2022 · 15 comments
Labels
idea Needs of discussion to become an enhancement, not ready for implementation

Comments

@steve-chavez
Copy link
Member

steve-chavez commented Feb 7, 2022

Problem

Even with safeupdate, a DELETE /employees?salary=gt.1000 can still delete a large amount of rows.

We already have code for handling a special case of 1 affected row for update/delete, when using application/vnd.pgrst.object+json(CTSingularJSON).

failNotSingular :: ContentType -> Int64 -> Wai.Response -> DbHandler Wai.Response
failNotSingular contentType queryTotal response =
if contentType == CTSingularJSON && queryTotal /= 1 then
do
lift SQL.condemn
throwError $ Error.singularityError queryTotal

We can extend this logic to accept an arbitrary number of affected rows and reject the request if it doesn't comply. With this a user can ensure that an expected number of rows will be updated/deleted.

Proposal

Use a new Prefer header to specify the affected rows, a couple of options:

  1. Prefer: rows-affected=<number>

Specifies the affected rows in the header value. The default behavior we have being like Prefer: rows-affected=unlimited/infinity. A Prefer: rows-affected=1 would be equivalent to application/vnd.pgrst.object+json.

  1. Prefer: rows-affected=same-as-limit

This would reuse the ?limit=<number> query param value. The default behavior would also be "unlimited". To do an equivalent application/vnd.pgrst.object+json the request would be curl -X DELETE "host/tbl?limit=1" -H "Prefer: rows-affected=same-as-limit". We could also have a rows-affected=less-than-limit to use the limit as a threshold instead of an exact value. Though maybe just a Prefer: rows-affected=limited with a behavior of rows affected <= limit would be enough.

With this option, to globally disallow unlimited deletes/updates, we could change the default Prefer to rows-affected=limited and combine it with max-rows-mutated=<number>(#2155). If users wants to do a full table delete, they'd need to be explicit and specify a Prefer: rows-affected=unlimited.

Notes

The Prefer RFC says:

A server that does not recognize or is unable to comply with
particular preference tokens in the Prefer header field of a request
MUST ignore those tokens and continue processing instead of signaling
an error.

Before I understood that as "Prefer cannot reject a request", but handling=strict says:

The "handling=strict"
preference can be used to indicate that, while any particular error
may be recoverable, the client would prefer that the server reject
the request.

So I think the first paragraph above is mostly saying that we should ignore unrecognized/conflicting prefers, we can reject requests with Prefer.

@steve-chavez steve-chavez added the idea Needs of discussion to become an enhancement, not ready for implementation label Feb 7, 2022
@wolfgangwalther
Copy link
Member

Overall, this seems quite complex. Not necessarily because of the proposed solution, but because of the problem itself, I guess.

Maybe we can find a way to make this whole thing simpler?

I think we don't need too many options. The basic motivation should be: "When sending a request for mutation, I should be able to tell PostgREST what kind of mutation (number of rows) I roughly expect. If the result differs from that (heavily), then the request should be rejected because it's assumed to be faulty."

I think this is also the basic idea behind db-max-rows, only that this is set server-side.

From my perspective, we should:

  • Disallow limit and offset on mutation requests entirely. Those are not useful and likely to lead to confusion.
  • Fix db-max-rows, so that it works as expected for mutation requests.
  • Simplify the suggested Prefer header and name it something like Prefer: max-rows=xxx - similar to the server option. The upper limit for this value would be the server-side db-max-rows, but the user can prefer to choose a lower value for this specific request. This makes it easier to understand, because things behave the same way.

I'm not sure whether I like the idea of using a Prefer header here in general, will have to think about it. At first glance it kind of "conflicts" with the accept single object header - two different ways to express the same intent. Maybe we can find a more similar way to express both.

So I think the first paragraph above is mostly saying that we should ignore unrecognized/conflicting prefers, we can reject requests with Prefer.

Agreed.

@steve-chavez
Copy link
Member Author

steve-chavez commented Feb 7, 2022

Disallow limit and offset on mutation requests entirely. Those are not useful and likely to lead to confusion.

True, not useful, I think they were done like that because of consistency(there are tests too).

offset on mutations doesn't look useful at all, but limit could be in this type of scenarios:

https://blog.crunchydata.com/blog/simulating-update-or-delete-with-limit-in-postgres-ctes-to-the-rescue

Basically for perf reasons, deleting a large amount of rows can lock a table for too much time.

At first glance it kind of "conflicts" with the accept single object header - two different ways to express the same intent.

We could deprecate the vnd.pgrst.object+json in favor of the Prefer header, it's much clearer in its intent. Scratch that, forgot that vnd.pgrst.object+json makes the result a json object(not a json array of 1).

@steve-chavez
Copy link
Member Author

Maybe we can find a way to make this whole thing simpler?

The simplest thing could be to make ?limit on mutations to behave differently(breaking change), it would instead serve as a threshold of the affected rows, the request fails if surpassed.

@steve-chavez
Copy link
Member Author

https://blog.crunchydata.com/blog/simulating-update-or-delete-with-limit-in-postgres-ctes-to-the-rescue

Or instead make limit behave like on the blog post: just limit the amount of deletes/updates but succeed. A prefer header can make the request fail instead.

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Feb 7, 2022

We could deprecate the vnd.pgrst.object+json in favor of the Prefer header, it's much clearer in its intent. Scratch that, forgot that vnd.pgrst.object+json makes the result a json object(not a json array of 1).

This reminds me of #1655. Here we discussed how the accept header has two different things mixed together: The shape of the output + the "I need exactly 1 row".

Maybe we can decouple a few different things:

  • Disallow offset for mutating requests.
  • Re-implement limit for mutating requests as mentioned in the blog post. Just for batching - but no errors.
  • Use the Accept header strictly for the shape of the json response (this is basically what I asked for in Return 404 instead of 406 for json object response with no rows? #1655 - but without a replacement for the "exactly 1 row" aspect, yet)
  • Add some prefer headers that match db-max-rows closely, so that the user can lower their expected affected rows. This will throw an error if exceeded. Question: Should the counting for that depend on the Prefer count header, too? Is the threshold comparing against db-max-rows currently doing so?
  • Add something like a Prefer min-rows header additionally. If you wanted to do the same thing as before with the single accept header, you could do Prefer: min-rows=1,max-rows=1 - but with a lot more flexibility.

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Feb 8, 2022

Add some prefer headers that match db-max-rows closely, so that the user can lower their expected affected rows. This will throw an error if exceeded. Question: Should the counting for that depend on the Prefer count header, too? Is the threshold comparing against db-max-rows currently doing so?

Eh, I think I mis-remembered how db-max-rows works right now. db-max-rows is basically the server-side variant of ?limit right now, right?

So part of the suggestion would then be to change the way db-max-rows works for mutation requests, too. But that's discussed in #2155 I guess.

Edit: And that would still be inconsistent. To make it fully consistent even a GETrequest should error out when reaching db-max-rows. And that makes sense, too. After all db-max-rows is about preventing from malicious requests. Why would I want to return 1000 rows instead of an error, when I think the request is malicious?

@steve-chavez
Copy link
Member Author

steve-chavez commented Feb 8, 2022

db-max-rows is basically the server-side variant of ?limit right now, right?

Yes, correct.

Disallow offset for mutating requests.
Re-implement limit for mutating requests as mentioned in the blog post. Just for batching - but no errors.

Cool, I agree.

To make it fully consistent even a GETrequest should error out when reaching db-max-rows. And that makes sense, too. After all db-max-rows is about preventing from malicious requests. Why would I want to return 1000 rows instead of an error, when I think the request is malicious?

Another benefit of db-max-rows is to prevent OOMs from happening(accidental requests on big tables),. An error might be too harsh when trying reads I think.

Add some prefer headers that match db-max-rows closely, so that the user can lower their expected affected rows. This will throw an error if exceeded.

How about reusing the specified ?limit here coupled with a Prefer: limit-handling=fail-if-surpassed(can be other name) to throw an error? That seems much simpler than duplicating the number in the header.

If you wanted to do the same thing as before with the single accept header, you could do Prefer: min-rows=1,max-rows=1 - but with a lot more flexibility.

Same could be done with ?limit=1 plus a Prefer: limit-handling=fail-if-not-equal as well.

It just doesn't seem right to add an additional header to specify a number when ?limit can already do the job.

In fact, we already have the Range header, that acts the same as limit, we could reuse that as well.

@wolfgangwalther
Copy link
Member

Another benefit of db-max-rows is to prevent OOMs from happening(accidental requests on big tables),. An error might be too harsh when trying reads I think.

If a request is deemed as "accidental", it should not return success.

The problem with not returning an error if db-max-rows is hit: The user doesn't know about it. They might assume that they succeeded in getting all the data. Basically the same confusion as in #2155. I'm not even sure if the RFCs allow to repsond with a partial response to a request without range headers. Couldn't find anything right now in the RFCs.

How about reusing the specified ?limit here coupled with a Prefer: limit-handling=fail-if-surpassed (can be other name) to throw an error? That seems much simpler than duplicating the number in the header.

This is the part that I think makes it so complicated.

In fact, we already have the Range header, that acts the same as limit, we could reuse that as well.

And this is exactly the thing I think we should not do. The Range headers have quite clear semantics - and I don't think they make any sense for mutation requests. The RFC even says:

An origin server indicates response semantics by choosing an
appropriate status code depending on the result of processing the
POST request; almost all of the status codes defined by this
specification might be received in a response to POST (the exceptions
being 206 (Partial Content), 304 (Not Modified), and 416 (Range Not
Satisfiable)).

Clearly, POST is not meant to be used with range headers.

And I understand our limit parameter to be the querystring alternative to range headers - but other than that, they should be pretty much identical. We can't deprecate them, because nested limits can not be set via headers. But other than that, Range headers are the "right thing" to do instead, in a HTTP sense.

@steve-chavez
Copy link
Member Author

If a request is deemed as "accidental", it should not return success.
The problem with not returning an error if db-max-rows is hit: The user doesn't know about it.

Yeah, I agree somewhat. However until now max-rows has been useful in production instances for keeping OOMs at bay, I wouldn't be keen on causing a breaking change there. I have another idea(basically running an EXPLAIN and rejecting results that are too wide) for an additional safeguard that would err instead of succeed, that could be evaluated as an alternative later.

The RFC even says:
almost all of the status codes defined by this specification might be received in a response to POST (the exceptions
being 206 (Partial Content) ..and 416 (Range Not Satisfiable)).

So that means a POST request can never return "206 Partial Content" and thus it shouldn't be limited with ?limit or Range.. that seems reasonable. If we abide by that then we shouldn't apply limit on POST and maybe not even on PATCH/DELETE.. thus not implementing the batching limit mentioned above.

Prefer: min-rows=1,max-rows=1

So with that we're left with a new header. Hm, max-rows seems confusing because the config max-rows has a different behavior. Maybe a Prefer: max-affected/mutated?

@steve-chavez
Copy link
Member Author

Maybe a Prefer: max-affected/mutated?

Maybe just rows-mutated, that would fit with the max-rows-mutated proposed in #2155 (whatever name we choose, it should match with the global config)

@wolfgangwalther
Copy link
Member

Yeah, I agree somewhat. However until now max-rows has been useful in production instances for keeping OOMs at bay, I wouldn't be keen on causing a breaking change there.

Hm. The change required from the client-side isn't much really - client requests would just need to add a ?limit=1000 to the querystring to get the same as before. This could even be done in a reverse proxy.

I think the current implementation of db-max-rows is just flawed and we should fix it, even at the cost of a breaking change.

Alternatively, we implement the same things with a different name and deprecate db-max-rows - to be removed later. But keeping `db-max-rows' as flawed as it is long-term.... I don't like that idea too much.

I have another idea(basically running an EXPLAIN and rejecting results that are too wide) for an additional safeguard that would err instead of succeed, that could be evaluated as an alternative later.

This reminds me of the discussion around aggregates and using query cost estimation to block queries, too. So there are for sure some things we can do to prevent malicious requests.

So that means a POST request can never return "206 Partial Content" and thus it shouldn't be limited with ?limit or Range.. that seems reasonable. If we abide by that then we shouldn't apply limit on POST and maybe not even on PATCH/DELETE.. thus not implementing the batching limit mentioned above.

Agreed.

@steve-chavez
Copy link
Member Author

steve-chavez commented Mar 16, 2022

Since #2164 has gotten a bit complicated how about taking a different path.

Re-implement limit for mutating requests as mentioned in the blog post. Just for batching - but no errors.

The above behavior is useful(other RDBMS have it, it's a common use case in pg and is something that users would expect) so I'll go ahead and implement it. This will also mitigate the problem of updating/deleting too much which is really the main goal here.

The discussion about the RFC above regarding "206 Partial Content" is mostly adequate in the context of range headers but our limit is clearly a more advanced thing:

?limit is basically a range header on steroids, because it can be used for nested embeddings.

(From #2164 (comment))

@steve-chavez
Copy link
Member Author

The limited update/delete technique from https://blog.crunchydata.com/blog/simulating-update-or-delete-with-limit-in-postgres-ctes-to-the-rescue has some limitations:

  1. The ctid system column is not available on views(unless explicitly included).
  2. Querying the ctid system column requires the SELECT privilege.

1 could be circumvented by inferring a PK or unique column from the view's source table, this would depend on the internal cache freshness.

2 doesn't have a way around unfortunately so UPDATE + SELECT privilege will be a requirement.

@steve-chavez
Copy link
Member Author

2 doesn't have a way around unfortunately so UPDATE + SELECT privilege will be a requirement.

Actually it doesn't make much sense to have an UPDATE privilege without having a SELECT privilege on the PK at least - this is needed for updating particular rows otherwise it would all be full table UPDATEs. So this requirement seems fine.

@steve-chavez
Copy link
Member Author

Continuing this on #2887, a conditional request looks more fit than a Prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Needs of discussion to become an enhancement, not ready for implementation
Development

Successfully merging a pull request may close this issue.

2 participants