Skip to content

Realtime db triggers timeout after upgrading admin sdk to 9.6.0 #1231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sluramod opened this issue Apr 20, 2021 · 17 comments · Fixed by #1234
Closed

Realtime db triggers timeout after upgrading admin sdk to 9.6.0 #1231

sluramod opened this issue Apr 20, 2021 · 17 comments · Fixed by #1234
Assignees

Comments

@sluramod
Copy link

sluramod commented Apr 20, 2021

[REQUIRED] Describe your environment

  • Operating System version: GCF
  • Firebase SDK version: 8.4.1
  • Firebase Product: database, functions
  • Node.js version: 10
  • NPM version: 7.7.6

[REQUIRED] Describe the problem

After upgrading to firebase-admin 9.6.0 realtime database triggers timeout periodically. Downgrade to 9.5.0 solved the issues.

Steps to reproduce:

Write a realtime db trigger. Use root reference to realtime db provided by the trigger (change.after.ref.root) to access data (read or write). Do not initialize firebase admin (this maybe irrelevant). The function will work correctly for a number of invocations. Eventually, one of the invocations will lead to the function timing out with whatever timeout value is configured for that specific function (60 seconds by default).

Relevant Code:

The code is proprietary, but even the simplest code that reads data from a different location than triggered will exhibit this behavior.

return change.after.ref.root
        .child(`some/db/path`)
        .once('value').then(() => return true) // this will never resolve or fail

I wish I could attach a screenshot of the functions Health tab showing spike in latencies after the update. Reverting to 9.5.0 eliminated timeouts.

@harshit-j
Copy link

We're facing this issue in both versions, 9.6.0 and 9.5.0. We just redeployed all functions with 9.5.0 to see if this helps. I'll post if we face this issue again.

@hiranya911
Copy link
Contributor

There was an incident that disrupted RTDB connectivity for some users on 04/16: https://status.firebase.google.com/incidents/REUgDGe1YiqLTQScxNGP

Some of you may have been affected by that (unless you can consistently reproduce the issue). If the problem persists you might want to enable RTDB debug logs and see what's causing the timeouts. Usually you can do so by adding the following statement to your code (somewhere at the initialization):

admin.database.enableLogging(console.log);

@harshit-j
Copy link

The 04/16 incident you mentioned above did affect us. But it only caused internal server errors from the Firebase Node.js Admin SDK. It did not make any promises wait forever to resolve.
Sure, I'll enable logging if the issue persists.

@sluramod
Copy link
Author

This was not isolated to a particular incident. We had this issue for awhile trying to figure out what's going on, until finally downgraded to 9.5.0 by fixing versions in package.json. No more timeouts after the downgrade. No other changes except 9.6.0 => 9.5.0

@hiranya911
Copy link
Contributor

hiranya911 commented Apr 20, 2021

@sluramod would be great if you can also provide a debug log. How long does it take (or how many invocations) before you'd observe a timeout?

Also have you gone back to v9.6.0 and confirmed the problem occurs there again?

@sluramod
Copy link
Author

@hiranya911 I cannot identify a pattern how long or how many invocations it takes for the issue to present itself; if I had this info I would share it in the original bug description. We are not going back to 9.6.0 to test if the problem occurs there - our production environment is not for qa purposes.

If you want to reach out personally - I can share the details observed, I cannot do it in the context of this discussion.

@sluramod
Copy link
Author

@hiranya911 Perhaps we are running into this from #1194

RELEASE NOTE: The periodic token refresher background task has been decoupled from the SDK core, and moved into the RTDB module. This task no longer starts automatically, unless the admin.database() API is explicitly invoked.

The thing is, we don't want to initialize admin to improve cold startup time for our functions.

@hiranya911
Copy link
Contributor

firebase-functions uses firebase-admin under the hood. So it's likely still getting initialized.

So far I've been able to repro a couple of timeouts. These seem to have been caused due to the metadata server sending auth tokens with a very short expiry time. This behavior of the metadata server, coupled with some changes made in the above PR can potentially cause tight retry loops. I'm not sure if that's the exact issue, but it's something to look at.

Initial auth handshake:

2:46:18.567 PM
requestCounter
Refreshing token; forceRefresh: false
2:46:18.600 PM
requestCounter
Cached token: undefined
2:46:18.600 PM
requestCounter
New token: {"accessToken":"ya29.c.KqkB-wcimdtPRc3oO3n3UIzjn51roI5obKLD1hYly0j9Guec0LoduiBHIhbetwNXKxeyNz-MoEXLYnB9HROXWuAyc2IOh8cSFdyc6RPnpBFju-5suBcwsIwhEVfVVL84pkOoFbZpOcXSvMJ3psX3dIsaRxM2sO6DWuvfY7BuFeSvGqIWfjWmiePhkeSt-oJ2kyO6zaAcHUwTEc72uK_8exvEkYfiDmLS5OWxZA","expirationTime":1618956977595} 
2:46:18.600 PM
requestCounter
Scheduling proactive refresh for 1498998 

Proactive refresh after 25 minutes (original token has 5 more minutes to expire):

3:11:17.750 PM
requestCounter
Refreshing token; forceRefresh: true
3:11:19.350 PM
requestCounter
Cached token: {"accessToken":"ya29.c.KqkB-wcimdtPRc3oO3n3UIzjn51roI5obKLD1hYly0j9Guec0LoduiBHIhbetwNXKxeyNz-MoEXLYnB9HROXWuAyc2IOh8cSFdyc6RPnpBFju-5suBcwsIwhEVfVVL84pkOoFbZpOcXSvMJ3psX3dIsaRxM2sO6DWuvfY7BuFeSvGqIWfjWmiePhkeSt-oJ2kyO6zaAcHUwTEc72uK_8exvEkYfiDmLS5OWxZA","expirationTime":1618956977595} 
3:11:19.749 PM
requestCounter
New token: {"accessToken":"ya29.c.KqkB-wcimdtPRc3oO3n3UIzjn51roI5obKLD1hYly0j9Guec0LoduiBHIhbetwNXKxeyNz-MoEXLYnB9HROXWuAyc2IOh8cSFdyc6RPnpBFju-5suBcwsIwhEVfVVL84pkOoFbZpOcXSvMJ3psX3dIsaRxM2sO6DWuvfY7BuFeSvGqIWfjWmiePhkeSt-oJ2kyO6zaAcHUwTEc72uK_8exvEkYfiDmLS5OWxZA","expirationTime":1618956977350} 

Here the metadata server has sent the same token as before, with less than 5 minutes on it. SDK now enters a tight retry loop, degrading performance, which in some cases can end with a timeout.

@bolds07
Copy link

bolds07 commented Apr 21, 2021

I'm having a similar issue but I believe mine comes from firebase auth [although the possibilite of the realtime database being the cause is not dismissed]. I've filled my own issue #1233 [before finding this one]

in my analisys is very consistent the 30 minutes interval between timeouts

The thing is, we don't want to initialize admin to improve cold startup time for our functions.

How big it the problem of calling admin.database() if 100% of the requests will eventually call it?

@hiranya911
Copy link
Contributor

I believe I have isolated the issue with RTDB calls periodically timing out. This happens when the metadata server sends a token that have a TTL shorter than 5 minutes (which seems to happen every 30 minutes or so in Cloud Functions). Here's an example sequence of events leading to the problem:

  • 2:46:18.567 PM: SDK initializes database (admin.database()) and registers the token refresh event to 25 minutes (TTL - 5 minutes)
  • 3:11:19.350 PM: Token refresh event fires, but the metadata server sends back the same token now with a TTL less than 5 minutes

The 2nd step above triggers the following event handler:

private onTokenChange(_: string): void {
this.appInternal.INTERNAL.getToken()
.then((token) => {

That in turns calls the following logic:

public getToken(forceRefresh = false): Promise<FirebaseAccessToken> {
if (forceRefresh || this.shouldRefresh()) {
return this.refreshToken();
}

But since the last fetched token has a short TTL, it causes the SDK to attempt a token refresh again. This results a in a tight loop, with separate RPCs for each refresh attempt, and can result in a timeout in some cases. The loop continues until the metadata server eventually sends a fresh token with a TTL longer than 5 minutes.

I have a fix in progress.

@bolds07
Copy link

bolds07 commented Apr 21, 2021

@hiranya911,

I believe this issue also happens in other firebase SDK.
Since feb/march I'm facing a situation that some android users at some sessions never get answer from database queries

   FirebaseDatabase.getInstance().getReference("any/node").addListenerForSingleValueEvent(new ValueEventListener() {
                @Override
                public void onDataChange(@NonNull DataSnapshot snapshot) {
                    //not called                    
                }

                @Override
                public void onCancelled(@NonNull DatabaseError error) {
                    //neither
                }
            });

During the past month I looked in many possible ways for the error always assuming the problem was with the firebase rules related to the queries... but since you confirmed there is really an issue with the tokenization that would easily explain this problem i face on the android sdk

I'm using the firebase-database: 19.7.0 for android

could you forward that to the responsible team?

@hiranya911
Copy link
Contributor

but since you confirmed there is really an issue with the tokenization that would easily explain this problem i face on the android sdk

I'm afraid it doesn't. Server-side Firebase SDKs are quite different from client SDKs, and the problem I've outlined above is extremely specific to server-side code running on Google Cloud Functions. It cannot explain any client-side problems.

I'm using the firebase-database: 19.7.0 for android

Please file a bug report at https://github.com/firebase/firebase-android-sdk

@bolds07
Copy link

bolds07 commented Apr 21, 2021

@hiranya911 I believe that this specific feature is quite similar...

The client SDK also has a token that refresh from time to time.... As you said the whole problem starts because the server sends a token with ttl less than 5 minutes....

If both client and admin sdk are connected to the same token server it is very likely to the problem be the same....

the difference is that in a firebase function this will cause a timeout and in an android app this error will cause the app to take forever to load something making the user bored [close the app] or try to use the app without load [possibly causing other errors]

For sure my app has dozens of database requests but the one that signs to me a red flag are the ones related to login....

Think with me... if someone tries to use an app and the login page takes over 5 minutes this person will asume imediatally there is a bug... lots of complaims appear on google play...

I will open an issue in the firebase android sdk and tag you, if you can please wave to the engineers over there it would help a lot, because many times when we open an issue here it bounces a lot until the engineers assume there is a problem in the firebasse side (since the first assumption is always that the problem is from our code)

@hiranya911
Copy link
Contributor

hiranya911 commented Apr 21, 2021

If both client and admin sdk are connected to the same token server it is very likely to the problem be the same....

They are not. Admin SDKs use Google OAuth, whereas client SDKs use Firebase access tokens. These are entirely different services, and even different token standards.

Generally speaking, a Firebase access token is always valid for 1 hour. Google OAuth tokens are also usually valid for 1 hour, unless they were minted by the metadata server on Google Cloud Functions -- in which case it can have a shorter TTL.

@bolds07
Copy link

bolds07 commented Apr 22, 2021

@hiranya911
just out of curiosity, are you going to solve it in the SDK code or in the token server code?

i've downgraded to 9.5.0 and i'm still facing the issue

2021-04-22 11:13:36.702 ARTwarsawknh2vg4esm9n Function execution took 40003 ms, finished with status: 'timeout'

2021-04-22 11:39:32.446 ARTwarsawx8tmqmc7iwy5 Function execution took 40003 ms, finished with status: 'timeout'

2021-04-22 12:05:47.911 ARTwarsawn84duszdtxxk Function execution took 40026 ms, finished with status: 'timeout'

2021-04-22 12:31:46.712 ARTwarsaw79fd6hyb78ir Function execution took 40077 ms, finished with status: 'timeout'

2021-04-22 12:58:49.611 ARTwarsawbadgvv5xmrdy Function execution took 40100 ms, finished with status: 'timeout'

2021-04-22 13:25:43.130 ARTwarsaw5t7jtsqndcbx Function execution took 40005 ms, finished with status: 'timeout'

2021-04-22 13:51:52.362 ARTwarsawdlnryq9velya Function execution took 40004 ms, finished with status: 'timeout'

2021-04-22 14:18:06.714 ARTwarsawsk8xp9jgdth8 Function execution took 40101 ms, finished with status: 'timeout'

2021-04-22 14:45:13.411 ARTwarsawaocmksyfuil9 Function execution took 40102 ms, finished with status: 'timeout'

@hiranya911
Copy link
Contributor

hiranya911 commented Apr 22, 2021

9.5.0 shouldn't be affected by the same issue. As @sluramod has reported above (and as also confirmed in our own testing), the problem only affects 9.6.0.

The fix #1234 is for the SDK code. We have no control over token server implementation.

PS: Btw you timestamps above are all ~25 minutes apart. That's exactly what I'd expect to see when affected by this bug. I think your function is somehow still deployed with v9.6.0.

@bolds07
Copy link

bolds07 commented Apr 22, 2021

I'm using 9.4.2 [because was the last from last year and i'm sure my problems started this year]

if older versions are not affected by this issue it means my original issue #1233 shall be another problem

if you can have a look please, it is basically same premisse but instead of firebase query is authtoken creation

auth.createCustomToken(uid).then(function(customToken) {

       res.status(200).json({
           t: customToken,
           n: newUser,
           r: regDate
       });
   }).catch(function(error) {
       console.error("Error creating custom token: " + JSON.stringify(error));
       res.status(200).json({
           e: error
       });
   });

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants