Skip to content

feat: Add application latencies to readRows calls #1609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 456 commits into
base: main
Choose a base branch
from

Conversation

danieljbruce
Copy link
Contributor

@danieljbruce danieljbruce commented May 29, 2025

Description

This PR adds application latency collection to readRows calls.

Impact

Now we can explore application latencies in customer code to investigate where work needs to be done in our clients.

Testing

Added a test that loops through a stream and measures application latencies.
Added another test that handles events on a stream and measures application latencies.
Moved some test utilities around.

Additional Information

This PR builds on top of #1571.

The idea here is that every time we read a row from the stream, the transform and then read functions get called to pass down a new row to take the place of the one that the user just read. This means we can take a time snapshot when the row passes through the transform function and use this to record application latencies. It turns out that in our experiment in 359913994-third-PR-CSM-with-application-latencies-experiment where the actual application latency is 5s, this is a pretty good approximation.

image

The more complex case is when the blocking application awaits events that have been added to the end of the event loop and the event loop already contains events, like processing a grpc error for example, that we don't want to measure. In this case I'm not sure there is any way to measure application latencies excluding other events on the event loop.

Alternatives

We could provide an even more exact measurement if we provided an API surface to the user to allow them to start and stop the timer. For instance, if we allow users to provide a metrics handler with methods startApplicationLatencyTimer and stopApplicationLatencyTimer and then they call those methods in their code then they could get a more exact measurement. In our experiment though, I can see this only accounts for a few milliseconds so is probably not worth the extra API surface and development time investment.

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigtable Issues related to the googleapis/nodejs-bigtable API. labels May 29, 2025
},
projectId,
retryCount: 0,
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code has been moved out into a function and the check has been abstracted so that it can vary for each test. This is because the check is different for application latencies.

@danieljbruce danieljbruce requested a review from mutianf May 29, 2025 17:32
@danieljbruce danieljbruce marked this pull request as ready for review May 29, 2025 17:32
@danieljbruce danieljbruce requested review from a team as code owners May 29, 2025 17:32
if (this.lastRowReceivedTime) {
// application latency is measured in total milliseconds.
const applicationLatency = Number(
(currentTime - this.lastRowReceivedTime) / BigInt(1000000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right? How do we make sure currentTime - lastRowReceivedTime doesn't include server response time? (bigtable client waiting for server to return a response)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For application latencies it is going to be impossible to get an exact measurement because there are just too many different combinations of what could happen on the Node event loop.

The idea here is that between row reads in the user's application we can expect it to include all the code that runs in the for loop. If there is an await operation inside the for loop it might process some other events on the event loop though which shouldn't be included? I used recordingApplicationLatencies so that it would stop recording times if a retryable error occurred for instance to remove time caused by a call delay.

Just to be sure, this is what we want to measure right?

// Application latencies is difference between start time and end time in for loop
const stream = table.createReadStream();
for await (const row of stream) {
  // Record start time
  // Run application code
  // Record end time
}

I think all the different combinations we need to capture might need a document because getting a perfect application latencies measurement may be difficult.

Copy link

@bhshkh bhshkh Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your understanding of the desired metric is correct. Ideally, we want to measure the execution time of your application code within each iteration of the for await (const row of stream) loop, as your snippet illustrates:

// Record start time <-- Just before your code for a row
// Run application code
// Record end time <-- Just after your code for a row
This means isolating the time spent actively processing a row in the application, excluding:

  1. Time spent by the library waiting for the server to send the next row.
  2. Server-side processing time to prepare the next row.
  3. Network latency for the next row.
  4. Time spent on other tasks in the Node.js event loop if the library awaits unrelated operations.

The current implementation in onRowReachesUser, which calculates currentTime - lastRowReceivedTime, measures the duration from when the previous row was made available to the library to when the current row is made available. This interval inherently includes the server response time and network latency for fetching the current row, not just the library's processing time for the previous row.

While recordingApplicationLatencies helps by pausing metric collection during retries (which is good for excluding delays from retryable errors), it doesn't address the server/network time embedded in the currentTime - lastRowReceivedTime calculation during normal streaming.

Achieving a perfect measurement of only the application code's execution time is indeed challenging, as you pointed out, especially with the complexities of the event loop. The current metric is more of a "time between the library seeing row N and the library seeing row N+1", which includes library processing for row N and the time to get row N+1. The goal is strictly calculating library processing time per row, the current approach doesn't fully capture that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current metric is more of a "time between the library seeing row N and the library seeing row N+1"

I'm afraid there is a bit more to it than that. The Node library is pretty complex, but what I remember is that the user stream that the user reads from stores 1 row ready for the user and queues the rest in the writable buffer. When the for loop is reached, that 1 row is read and then the next row passes through the write and transform functions to be in the ready position for the next read. My code is collecting timestamps from the transform function and calculating application latencies from them. Let me give an example of a common situation of user code:

const rows = await table.createReadStream({limit: 7}); // Requests rows [0, 1, 2, 3, 4, 5, 6]
// Some code runs, rows [0, 1, 2, 3] have been fetched
// row 0 is ready to read, [1, 2, 3] are in the writable buffer of the user stream
for await (const row of stream) {
  // When row i is read, row i+1 passes through transform function so it is ready
  // Run user application code that processes the row
  // If this code is synchronous then no other events will process on the event loop
}
// Stream is empty, it has processed rows 0, 1, 2 and 3

The current implementation in onRowReachesUser, which calculates currentTime - lastRowReceivedTime, measures the duration from when the previous row was made available to the library to when the current row is made available

This is not true. The loop processes all rows that are currently available and then the program starts executing the code after the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie. the loop isn't waiting for more data to arrive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted about this more offline and it looks like with user code that has stream handlers we need to know about when the client is waiting for a server response. Looks like the implementation needs to be a little bit more complex and we need to figure out how to track server response time.

Base automatically changed from 359913994-third-PR-CSM to main June 25, 2025 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/nodejs-bigtable API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants