feat: Add application latencies to readRows calls #1609

danieljbruce · 2025-05-29T15:21:46Z

Description

This PR adds application latency collection to readRows calls.

Impact

Now we can explore application latencies in customer code to investigate where work needs to be done in our clients.

Testing

Added a test that loops through a stream and measures application latencies.
Added another test that handles events on a stream and measures application latencies.
Moved some test utilities around.

Additional Information

This PR builds on top of #1571.

The idea here is that every time we read a row from the stream, the transform and then read functions get called to pass down a new row to take the place of the one that the user just read. This means we can take a time snapshot when the row passes through the transform function and use this to record application latencies. It turns out that in our experiment in 359913994-third-PR-CSM-with-application-latencies-experiment where the actual application latency is 5s, this is a pretty good approximation.

The more complex case is when the blocking application awaits events that have been added to the end of the event loop and the event loop already contains events, like processing a grpc error for example, that we don't want to measure. In this case I'm not sure there is any way to measure application latencies excluding other events on the event loop.

Alternatives

We could provide an even more exact measurement if we provided an API surface to the user to allow them to start and stop the timer. For instance, if we allow users to provide a metrics handler with methods startApplicationLatencyTimer and stopApplicationLatencyTimer and then they call those methods in their code then they could get a more exact measurement. In our experiment though, I can see this only accounts for a few milliseconds so is probably not worth the extra API surface and development time investment.

end to end

…into 359913994-exporter-PR # Conflicts: # package.json # src/client-side-metrics/client-side-metrics-attributes.ts # src/client-side-metrics/metrics-handler.ts # test/metrics-collector/typical-method-call.txt

is used

danieljbruce · 2025-05-29T15:38:15Z

system-test/client-side-metrics.ts

-                },
-                projectId,
-                retryCount: 0,
-              });


The above code has been moved out into a function and the check has been abstracted so that it can vary for each test. This is because the check is different for application latencies.

Also added threading through for readRows calls

…s/nodejs-bigtable into 359913994-third-PR-CSM-with-application-latencies # Conflicts: # system-test/client-side-metrics.ts

mutianf · 2025-05-30T18:23:17Z

src/client-side-metrics/operation-metrics-collector.ts

+        if (this.lastRowReceivedTime) {
+          // application latency is measured in total milliseconds.
+          const applicationLatency = Number(
+            (currentTime - this.lastRowReceivedTime) / BigInt(1000000),


I don't think this is right? How do we make sure currentTime - lastRowReceivedTime doesn't include server response time? (bigtable client waiting for server to return a response)

For application latencies it is going to be impossible to get an exact measurement because there are just too many different combinations of what could happen on the Node event loop.

The idea here is that between row reads in the user's application we can expect it to include all the code that runs in the for loop. If there is an await operation inside the for loop it might process some other events on the event loop though which shouldn't be included? I used recordingApplicationLatencies so that it would stop recording times if a retryable error occurred for instance to remove time caused by a call delay.

Just to be sure, this is what we want to measure right?

// Application latencies is difference between start time and end time in for loop const stream = table.createReadStream(); for await (const row of stream) { // Record start time // Run application code // Record end time }

I think all the different combinations we need to capture might need a document because getting a perfect application latencies measurement may be difficult.

Yes, your understanding of the desired metric is correct. Ideally, we want to measure the execution time of your application code within each iteration of the for await (const row of stream) loop, as your snippet illustrates:

// Record start time <-- Just before your code for a row
// Run application code
// Record end time <-- Just after your code for a row
This means isolating the time spent actively processing a row in the application, excluding:

Time spent by the library waiting for the server to send the next row.

Server-side processing time to prepare the next row.

Network latency for the next row.

Time spent on other tasks in the Node.js event loop if the library awaits unrelated operations.

The current implementation in onRowReachesUser, which calculates currentTime - lastRowReceivedTime, measures the duration from when the previous row was made available to the library to when the current row is made available. This interval inherently includes the server response time and network latency for fetching the current row, not just the library's processing time for the previous row.

While recordingApplicationLatencies helps by pausing metric collection during retries (which is good for excluding delays from retryable errors), it doesn't address the server/network time embedded in the currentTime - lastRowReceivedTime calculation during normal streaming.

Achieving a perfect measurement of only the application code's execution time is indeed challenging, as you pointed out, especially with the complexities of the event loop. The current metric is more of a "time between the library seeing row N and the library seeing row N+1", which includes library processing for row N and the time to get row N+1. The goal is strictly calculating library processing time per row, the current approach doesn't fully capture that.

The current metric is more of a "time between the library seeing row N and the library seeing row N+1"

I'm afraid there is a bit more to it than that. The Node library is pretty complex, but what I remember is that the user stream that the user reads from stores 1 row ready for the user and queues the rest in the writable buffer. When the for loop is reached, that 1 row is read and then the next row passes through the write and transform functions to be in the ready position for the next read. My code is collecting timestamps from the transform function and calculating application latencies from them. Let me give an example of a common situation of user code:

const rows = await table.createReadStream({limit: 7}); // Requests rows [0, 1, 2, 3, 4, 5, 6] // Some code runs, rows [0, 1, 2, 3] have been fetched // row 0 is ready to read, [1, 2, 3] are in the writable buffer of the user stream for await (const row of stream) { // When row i is read, row i+1 passes through transform function so it is ready // Run user application code that processes the row // If this code is synchronous then no other events will process on the event loop } // Stream is empty, it has processed rows 0, 1, 2 and 3

The current implementation in onRowReachesUser, which calculates currentTime - lastRowReceivedTime, measures the duration from when the previous row was made available to the library to when the current row is made available

This is not true. The loop processes all rows that are currently available and then the program starts executing the code after the loop.

ie. the loop isn't waiting for more data to arrive

We chatted about this more offline and it looks like with user code that has stream handlers we need to know about when the client is waiting for a server response. Looks like the implementation needs to be a little bit more complex and we need to figure out how to track server response time.

danieljbruce added 30 commits February 21, 2025 14:21

Correct the fixtures

54ac764

Eliminate tests using the old export input fixture

eb8f14b

Add headers

6ecb1a6

run linter

fa0a56e

Modify the gcp-metrics-handler and test the proces

fcef83d

end to end

Remove only

cd2efac

Use a fake projectId

eba027c

Only call export once

5929a9d

Ensure test suite completes

3b48c8e

Remove shutdown

8edc4ab

remove async

8c9d23f

Don’t export the data twice

7b49f01

Increase the timeout

b4f7705

Use the PushMetricExporter interface

2a32459

Removed some interfaces that are not used anymore

e5caa9e

Update JSdoc

fc114ff

Move fake start time and fake end time

6fb5944

Remove the TODO

ca6f05e

Update documentation

4bec216

Add additional information to the error reported

bd4b0ac

Move start time and end time

c191614

Try to use timestamps in order

86be1ea

Reduce timestamp delay

3b0f081

Remove only

3ebb9ff

Inline addFakeRecentTimestamps

cf32131

Move replace timestamps into the only file it

78a20d4

is used

Fix comment

105b58b

Rename the metric types

d4022fd

Generate documentation for the new classes

7ea28d2

danieljbruce added 9 commits May 28, 2025 14:43

Add deterministic timestamps to test

97ead3e

Collect application latencies

15204af

proxyquire fix

1f2e1d6

Setup the stream test

cffc3ac

Add chunk generation to the test

a33f9cb

Move mock server to test params

c594bbd

Include more files in tsconfig

f44e2c8

Remove imports

0d538bd

Add comments explaining different latencies

59b56e5

product-auto-label bot added size: l Pull request size is large. api: bigtable Issues related to the googleapis/nodejs-bigtable API. labels May 29, 2025

danieljbruce added 2 commits May 29, 2025 11:35

Add a test with handlers

1ee7acc

Remove only

d5ef2a9

danieljbruce commented May 29, 2025

View reviewed changes

danieljbruce added 3 commits May 29, 2025 11:39

No TODO

bf7bf58

Better comments

4249d10

Return file to original location

247fd5d

danieljbruce requested a review from mutianf May 29, 2025 17:32

danieljbruce marked this pull request as ready for review May 29, 2025 17:32

danieljbruce requested review from a team as code owners May 29, 2025 17:32

danieljbruce added 7 commits May 29, 2025 14:16

Better naming conventions for project

8757a65

don’t retry MSC, warning

266e65e

Change to unspecified

b394dcd

Remove unused method

910ea86

Added logic for recording application latencies

25f3beb

Also added threading through for readRows calls

Remove only

602bc78

Merge branch '359913994-third-PR-CSM' of https://github.com/googleapi…

67fa973

…s/nodejs-bigtable into 359913994-third-PR-CSM-with-application-latencies # Conflicts: # system-test/client-side-metrics.ts

mutianf reviewed May 30, 2025

View reviewed changes

Base automatically changed from 359913994-third-PR-CSM to main June 25, 2025 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add application latencies to readRows calls #1609

feat: Add application latencies to readRows calls #1609

Uh oh!

danieljbruce commented May 29, 2025 •

edited

Loading

Uh oh!

danieljbruce May 29, 2025

Uh oh!

mutianf May 30, 2025

Uh oh!

danieljbruce May 30, 2025

Uh oh!

bhshkh Jun 2, 2025 •

edited

Loading

Uh oh!

danieljbruce Jun 2, 2025

Uh oh!

danieljbruce Jun 2, 2025

Uh oh!

danieljbruce Jun 3, 2025

Uh oh!

Uh oh!

feat: Add application latencies to readRows calls #1609

Are you sure you want to change the base?

feat: Add application latencies to readRows calls #1609

Uh oh!

Conversation

danieljbruce commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Impact

Testing

Additional Information

Alternatives

Uh oh!

danieljbruce May 29, 2025

Choose a reason for hiding this comment

Uh oh!

mutianf May 30, 2025

Choose a reason for hiding this comment

Uh oh!

danieljbruce May 30, 2025

Choose a reason for hiding this comment

Uh oh!

bhshkh Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danieljbruce Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

danieljbruce Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

danieljbruce Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danieljbruce commented May 29, 2025 •

edited

Loading

bhshkh Jun 2, 2025 •

edited

Loading