Skip to content

API endpoints for the series deletion API using block storage #4370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

ilangofman
Copy link
Contributor

@ilangofman ilangofman commented Jul 19, 2021

What this PR does:
This PR implements the API endpoints for the time series deletion API for block storage. Based on the proposal for the series deletion API here.

The work done in this PR includes creating, getting and canceling deletion requests.

  • When a delete request is received, a tombstone is created to track the deletion information and state.
  • Getting the delete request information requires reading the tombstone files.
  • Cancelling a deletion is done by deleting the tombstone. The cancellation can only be done while the -purger.delete-request-cancel-period hasn't passed since the delete request was originally made.

For the API to be enabled, set --purger.enable to true.

The new endpoints are located within the purger, where the existing tenant deletion API is also located. However, the files are under a chunks subdirectory but contain existing APIs for blocks storage. A future PR will be required to move the purger service out of the chunks subdirectory.

To make it easier to review and get incremental feedback, the implementation of the proposal is going to be split among roughly 4 PR's.

  • API endpoints (this PR)
  • Permanent deletion processing
  • Query time filtering using tombstones before data is deleted
  • Cache invalidation

Please let me know if you have any questions or concerns. Any feedback is appreciated. Thank you for taking a look at this PR and helping out!

Which issue(s) this PR fixes:
Initial PR of multiple to address #4267

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: ilangofman <[email protected]>
@jeromeinsf
Copy link
Contributor

Can you please also review this PR @alolita @Aneurysm9

@bboreham
Copy link
Contributor

I'm a bit confused why files disappeared like pkg/chunk/purger/tenant_deletion_api.go and the why the blocks purger is under pkg/chunk.

Did any functionality disappear? Could the blocks functionality live under pkg/storage/tsdb ?

@ilangofman
Copy link
Contributor Author

ilangofman commented Aug 11, 2021

I'm a bit confused why files disappeared like pkg/chunk/purger/tenant_deletion_api.go and the why the blocks purger is under pkg/chunk.

Did any functionality disappear?

@bboreham No functionality disappeared. The pkg/chunk/purger/tenant_deletion_api.go file was the blocks implementation of the tenant deletion endpoints. Since I created more endpoints for the series deletion with blocks storage, I just combined both under one file and called it the blocks purger. So all the previous code in tenant_deletion_api.go is now in pkg/chunk/purger/blocks_purger.go. I can undo that and create individual files for them if you would like.

Could the blocks functionality live under pkg/storage/tsdb ?

I agree, it is a bit confusing that some of the blocks functionality is under the chunks folder. I can try moving it to the suggested folder. If the answer to my previous question was to keep the pkg/chunk/purger/tenant_deletion_api.go separate, I am guessing it would make sense to move that as well to this folder, since it is only for blocks storage, WDYT?

@bboreham
Copy link
Contributor

I just combined both under one file and called it the blocks purger

This would be a good thing to explain in the PR description, to aid reviewers.

Also you could organise the change into commits where one is a pure refactor, just moving code, and then subsequent ones modify the behaviour. The current set of 22 commits shows the journey that you went on, but is not a good breakdown for reviewing.

@bboreham
Copy link
Contributor

I expect when we remove chunks support (which is deprecated), then we can delete pkg/chunks, so don't add anything blocks-related under there.

@ilangofman
Copy link
Contributor Author

Can you explain the distinction between "delete request" and "tombstone" as used in this code?

The way I have been thinking about it is that the tombstone is just a file to keep track of a series deletion request. Getting the delete request information would mean reading the tombstone for it. Admittedly, I have been using the two terms synonymously. I'll update the code to be more consistent.
If you see any issues with that, please let me know.

Signed-off-by: ilangofman <[email protected]>
@jeromeinsf
Copy link
Contributor

@ilangofman Are you updating the code to be more consistent or should someone take over?

@ilangofman
Copy link
Contributor Author

@ilangofman Are you updating the code to be more consistent or should someone take over?

Hi @jeromeinsf . I took a couple of weeks off but I am going to get back to this. I think this PR I should be able to finish on my own.

For the rest of the deletion PR's, I am not sure how much time I will have available to work on them, with university starting this week. I would still like to contribute when I can but it might take some time to make changes. To speed things up, if someone wants to take over the other PR's or collaborate together, I'm open to all!

Copy link
Contributor

@alvinlin123 alvinlin123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the first batch of review comments.

hash.Write(bufStart)
hash.Write(bufEnd)

sort.Strings(selectors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hashing matchers the right way to create a "deletion ID"? For example I can have matcher like match[]=process_start_time_seconds{job="prometheus"} and match[]=process_start_time_seconds{job= "prometheus"} (there are some whitespaces in the second query). They are logically the same matcher, but will end up with different hash value.

Is this intended that create deletion request de-duplication is a best effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it so it parses the matchers and then create the hash based on the parsed matchers. So any logically equivalent deletion requests should now result in the same hash. The example you provided should be the same hash. Also split up matchers like match[]=process_start_time_seconds&match[]={job= "prometheus"} would result in the same hash as the example you provided (Since a deletion request is the "AND" of all matchers provided).

// if the request was previously cancelled, we need to remove the cancelled tombstone before adding the pending one
if err := tManager.RemoveCancelledStateIfExists(ctx, requestID); err != nil {
level.Error(util_log.Logger).Log("msg", "removing cancelled tombstone state if it exists", "err", err)
http.Error(w, err.Error(), http.StatusInternalServerError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need to be careful to just propagate error back to caller; we might inadvertently expose some detail that should not be exposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to a message instead of propagating the error.

Comment on lines +142 to +146
userID, err := tenant.TenantID(ctx)
if err != nil {
http.Error(w, err.Error(), http.StatusUnauthorized)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty common code, I think existing Cortex code uses middleware.AuthenticateUser, any reason why we can't use the same function for delete series APIs?

http.Error(w, fmt.Sprintf("Error marshalling response: %v", err), http.StatusInternalServerError)
}

w.WriteHeader(http.StatusOK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://pkg.go.dev/net/http#ResponseWriter since you called json.NewEncoder(w), which I assume calls ResponseWriter.Write(), then status code of 200 is automatically written for you together with "guessed" content-type. So you don't really need to set the status at this line.

However, I think we should set the content type and the status header ourselves for clarity.

Also, I think the if block on line 155 is missing a return statement right? I would expect http.Error() writes that the status header to 500 already, and this w.WriteHeader(http.StatusOk) may have undefined behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, should be fixed now.

}

params := r.URL.Query()
requestID := params.Get("request_id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do input validation here? What if request_id is not given?

return
}

if deleteRequest.State == cortex_tsdb.StateProcessed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a "processing" state? Or we can rely on the deleteRequestCancelPeriod to infer the "processing" state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can use the deleteRequestCancelPeriod to infer the processing state. Whenever the tombstone with statePending is older than deleteRequestCancelPeriod, then it is in processing state.

return err
}

// If the newer state found comes later in the order, then we replace the state in the map.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think it might be better to name the variables newerStateOrder and prevStateOrder; this do imply that if we rename state to newerState the readability will improve too :)

return false
}

func (s BlockDeleteRequestState) GetStateOrder() (int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a wise idea to export this method. This looks a lot like implementation detail that should not be exposed. I don't think we want anyone outside the tsdb package to depend on the fact that the "order" of StatePending is 0, "StateProcessed" is 2, etc. right? Because what is important is the relation between the "order" not the absolute value right?

If we really want to expose something, consider functions like IsGreaterThan(a, b BlockDeleteRequestState), but I don't think we should in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why it was exported in the first place. I don't think the logic is required anywhere else. I made it private now.

}

if err = m.DeleteTombstoneFile(ctx, t.RequestID, t.State); err != nil {
level.Error(m.logger).Log("msg", "Created file with updated state but unable to delete previous state. Will retry next time tombstones are loaded", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to log the old and new state.


err := m.WriteTombstone(ctx, newT)
if err != nil {
level.Error(m.logger).Log("msg", "error creating file tombstone file with the updated state", "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to log the new state.


for _, state := range AllDeletionStates {
filename := getTombstoneFileName(requestID, state)
exists, err := m.TombstoneExists(ctx, requestID, state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really need this extra check, which causes extra network call (I believe the m.TombstoneExists translates to a S3 HEAD http call).

Would it be possible to just do m.ReadTombstone and differentiate if the returned err is NotFound or other errors, then handle them differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! We can definitely do without it.

@stale
Copy link

stale bot commented Feb 20, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 20, 2022
@ilangofman
Copy link
Contributor Author

Sorry everyone, there has been no activity from me on this. I have been really busy with school and unfortunately haven't had time to work on this. I do plan to come back to this once I get a bit more time available, but if someone else wants to help out, that would be great too.

@stale stale bot removed the stale label Mar 2, 2022
@aughr aughr mentioned this pull request Mar 21, 2022
3 tasks
@stale
Copy link

stale bot commented Jun 11, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 11, 2022
@alvinlin123
Copy link
Contributor

@ilangofman may start to work on this again soon :) so not staled.

@stale stale bot removed the stale label Jun 11, 2022
@stale
Copy link

stale bot commented Oct 16, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 16, 2022
@jeromeinsf
Copy link
Contributor

/remove stale

@stale stale bot closed this Nov 1, 2022
@alvinlin123 alvinlin123 reopened this Nov 1, 2022
@alvinlin123 alvinlin123 added the keepalive Skipped by stale bot label Nov 1, 2022
@stale stale bot removed the stale label Nov 1, 2022
@CharlieTLe
Copy link
Member

Hello @ilangofman, thank you for opening this PR.

There is a release in progress. As such, please rebase your CHANGELOG entry on top of the master branch and move the CHANGELOG entry to the top under ## master / unreleased.

Thanks,
Charlie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Skipped by stale bot size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants