What would you like to be added:
Currently the conversation replay datagen's max-model-len truncation logic tokenizes each request once more and truncates specific portions to see if max_model_len is exceeded. This can lead to higher CPU utilization when context is large. Instead, avoid retokenization as much as possible and simplify the logic so that it is both easier to understand and is more efficient.
Why is this needed:
To improve efficiency and readability.
What would you like to be added:
Currently the conversation replay datagen's max-model-len truncation logic tokenizes each request once more and truncates specific portions to see if max_model_len is exceeded. This can lead to higher CPU utilization when context is large. Instead, avoid retokenization as much as possible and simplify the logic so that it is both easier to understand and is more efficient.
Why is this needed:
To improve efficiency and readability.