Optimizing record-level operations in the SDK #867
aaronsteers
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Opening this discussion to explore options for record-level processing optimizations.
Known options
Other throttling factors
As a general rule, only for pipelines that already have an extremely high throughput will benefit at all from record-level python optimizations.
Many taps and targets have performance throttled by factors unrelated to python record parsing.
In many cases, the above factors actually dwarf the python record processing times. Before assuming that any specific pipelines perf optimizations in Python will have noticeable benefit, the above should be closely evaluated.
Challenges with record-level optimizations
If the tap developer needs to perform custom
post_process()operations, these would necessarily be running in Python.To overcome this challenge, mappings and transformations may need to be defined declaratively, for instance with something like
MyStream.ignored_properties = ['ignored_prop1', 'complex.ignored_subprop2']to remove nodes andMyStream.property_remappings = {'new.node.loc': 'old.node.loc'}to declaratively remap properties. These transformations could then be performed at high speed using a low-level language or highly-optimized library.Of course, while many taps could benefit from this abstraction, some advanced postprocessing may may be difficult or impossible to abstract in this way.
Alternative approaches
The best alternative to speeding up record-level operations is for the taps and targets to not have to process individual records at all. If the source system can generate a batch files and the target system can read them, then each system is performing in its ideal performance situation, which is not subject to per-record performance considerations on the Python side at all.
See:
Beta Was this translation helpful? Give feedback.
All reactions