Skip to content

Conversation

@PenguinToast
Copy link
Contributor

@PenguinToast PenguinToast commented Jun 27, 2022

What changed?

Fix an issue where the history service would get stuck in a loop when reusing workflow ID's on top of ScyllaDB.

Why?

Closes #2683.

How did you test it?
First reproduced the issue by running Temporal on top of Scylla locally; then verified that the change fixed the issue. Ran Cassandra persistence tests to ensure that this change didn't break anything else.

Potential risks
As noted in the linked issue:

Not sure if there are conditions when Cassandra might also return nullable rows.

If there are cases where we should be handling nil rows as real errors, this change would break that.

Is hotfix candidate?
Yes.

@PenguinToast PenguinToast requested a review from a team as a code owner June 27, 2022 18:32
@CLAassistant
Copy link

CLAassistant commented Jun 27, 2022

CLA assistant check
All committers have signed the CLA.

@PenguinToast
Copy link
Contributor Author

There's some tests failing; looking into them

@PenguinToast
Copy link
Contributor Author

Fixed the tests; should be ready for review -- I'm not familiar with how to write idiomatic go; happy to make changes to this PR if needed.

cc @wxing1292 @yiminc

@brianlu-scale
Copy link

Oops, approved by accident, please ignore.

@PenguinToast PenguinToast requested a review from wxing1292 June 28, 2022 00:07
@wxing1292 wxing1292 requested a review from yiminc June 28, 2022 00:38
@alexshtin
Copy link
Contributor

I am confused about this PR. newConflictRecord() essentially returns map[string]**int. Then you expect cassandra client to write *int into type field, and then you dereference it to int and store in the same type field. So what is the type of type field is expected by caller?

@PenguinToast
Copy link
Contributor Author

I am confused about this PR. newConflictRecord() essentially returns map[string]**int. Then you expect cassandra client to write *int into type field, and then you dereference it to int and store in the same type field. So what is the type of type field is expected by caller?

We dereference and store to avoid changing all the downstream call sites that expect an int type -- happy to modify this PR to instead update those call sites to use an *int type instead, and update the relevant tests

Comment on lines 133 to 139
rowType, ok := conflictRecord["type"].(*int)
if !ok || rowType == nil {
return errors
}
// Dereference rowType for later use
conflictRecord["type"] = *rowType

Copy link
Contributor

@alexshtin alexshtin Jul 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally understood what you try to achieve here. I don't think that you need newConflictRecord func. It should work with default map[string]interface{} and here code should be:

Suggested change
rowType, ok := conflictRecord["type"].(*int)
if !ok || rowType == nil {
return errors
}
// Dereference rowType for later use
conflictRecord["type"] = *rowType
// ScyllaDB will return rows with null values to match # of queries in a batch query (see #2683).
// Field types will be pointer types (i.e. *int instead of int) in this case to support nil values.
// Check only "type" field for simplicity.
if rowType, ok := conflictRecord["type"].(*int); ok && rowType == nil {
return nil
}

and move this block before

var errors []error

line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't use newConflictRecord, then ok will be false when I try to get conflictRecord["type"].(*int) -- I'm not familiar enough with Go to know exactly why it works like that, but I'm assuming this is related to how the Cassandra client library works.

Copy link
Contributor

@alexshtin alexshtin Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there is definitely some sort of reflection magic inside Cassandra library. Can you try:

fmt.Printf("%T\n", conflictRecord["type"])

w/o newConflictRecord. It should print underlying type of "type" key. What is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be also great if you add test that fails in ScyllaDB w/o your fix and pass with it. Then I will be able to test it myself.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there is definitely some sort of reflection magic inside Cassandra library. Can you try:

fmt.Printf("%T\n", conflictRecord["type"])

w/o newConflictRecord. It should print underlying type of "type" key. What is it?

If I am not mistaken, this part does Unmarshalling.
https://github.com/gocql/gocql/blob/7a6cf00bbc98f4d7037e4a0fcca96fc946fd63d6/marshal.go#L205

We also used newConflictRecord way in our project to solve this issue, but @PenguinToast was faster :D
It is working on live env and seems to be no issues.

Copy link
Contributor

@alexshtin alexshtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test my suggestion with both DBs?

@PenguinToast
Copy link
Contributor Author

Can you test my suggestion with both DBs?

@alexshtin Responded to your suggestion -- without pre-populating a pointer type in the conflictRecord map, ok will be false if we try to read a pointer type out of the conflictRecord

@alexshtin alexshtin force-pushed the willsheu/fix-2683 branch from 24eab34 to e7978fe Compare July 15, 2022 07:03
@alexshtin
Copy link
Contributor

alexshtin commented Jul 15, 2022

I played with it today. Yes, gocql uses reflection magic. I didn't like that "type" value got dereferenced and saved under the same key in the map. I removed dereferencing part and change all other code to use "type" as *int. Tested (run persistence unit tests) myself with both Cassandra and ScyllaDB.

Thanks for contribution!

@alexshtin alexshtin merged commit ae5cad8 into temporalio:master Jul 15, 2022
@PenguinToast PenguinToast deleted the willsheu/fix-2683 branch July 15, 2022 17:56
@PenguinToast
Copy link
Contributor Author

I played with it today. Yes, gocql uses reflection magic. I didn't like that "type" value got dereferenced and saved under the same key in the map. I removed dereferencing part and change all other code to use "type" as *int. Tested (run persistence unit tests) myself with both Cassandra and ScyllaDB.

Thanks for contribution!

Agree that it's much cleaner with your change -- thanks for reviewing & fixing!

@nicklaros
Copy link

I wonder why it is removed from release/1.17.2, we use Scylla and encounter this issue so we need this?

v0id3r pushed a commit to art-technologies/temporal that referenced this pull request Aug 3, 2022
@alexshtin
Copy link
Contributor

Our patch release strategy is to patch regression or new features that were introduced in corresponding minor version. This issue was always there and we decided to include it in the next minor version release (1.18).

@brianlu-scale
Copy link

@alexshtin Is there any estimate for when 1.18 will come out?

PenguinToast added a commit to PenguinToast/temporal that referenced this pull request Aug 12, 2022
v0id3r pushed a commit to art-technologies/temporal that referenced this pull request Aug 18, 2022
v0id3r pushed a commit to art-technologies/temporal that referenced this pull request Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Temporal fails to create workflow with the same ID as previous one when Scylla is used as persistence

7 participants