-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
⚡️ Speed up function validate_gantt by 58x
#5386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
⚡️ Speed up function validate_gantt by 58x
#5386
Conversation
The optimization achieves a **58x speedup** by eliminating the major performance bottleneck in pandas DataFrame processing. **Key optimizations:** 1. **Pre-fetch column data as numpy arrays**: The original code used `df.iloc[index][key]` for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using `df[key].values` and stores it in a dictionary, then uses direct numpy array indexing `columns[key][index]` inside the loop. 2. **More efficient key validation**: Replaced the nested loop checking for missing keys with a single list comprehension `missing_keys = [key for key in REQUIRED_GANTT_KEYS if key not in df]`. 3. **Use actual DataFrame columns**: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses `list(df.columns)` to get only the actual column names. **Why this is dramatically faster:** - `df.iloc[index][key]` creates temporary pandas Series objects and involves complex indexing logic for each cell - Direct numpy array indexing `columns[key][index]` is orders of magnitude faster - The line profiler shows the original `df.iloc` line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms) **Performance characteristics:** - **Large DataFrames see massive gains**: 8000%+ speedup on 1000-row DataFrames - **Small DataFrames**: 40-50% faster - **List inputs**: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance - **Empty DataFrames**: Some slowdown due to upfront column extraction, but still fast overall This optimization is most beneficial for DataFrame inputs with many rows, where the repeated `iloc` calls created a severe performance bottleneck.
validate_gantt by 58x
validate_gantt by 58xvalidate_gantt by 58x
|
Thanks for the PR! Could you please add test coverage or demonstrate that test coverage is already provided? Some tests failed CI, but I think that's unrelated to your changes. |
|
@camdecoster just added a test for it. fixing the formatting issue now |
camdecoster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there could be some redundant tests with this test file. Could you please double check and remove any redundant tests from your PR?
| assert all(isinstance(x, dict) for x in result) | ||
|
|
||
|
|
||
| @pytest.mark.skipif(pd is None, reason="pandas is not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remove the skipif calls? Based on CI, Pandas will always be defined.
📄 5,759% (57.59x) speedup for
validate_ganttinplotly/figure_factory/_gantt.py⏱️ Runtime :
154 milliseconds→2.63 milliseconds(best of246runs)📝 Explanation and details
The optimization achieves a 58x speedup by eliminating the major performance bottleneck in pandas DataFrame processing.
Key optimizations:
Pre-fetch column data as numpy arrays: The original code used
df.iloc[index][key]for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront usingdf[key].valuesand stores it in a dictionary, then uses direct numpy array indexingcolumns[key][index]inside the loop.Use actual DataFrame columns: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses
list(df.columns)to get only the actual column names.Why this is dramatically faster:
df.iloc[index][key]creates temporary pandas Series objects and involves complex indexing logic for each cellcolumns[key][index]is orders of magnitude fasterdf.ilocline consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms)Performance characteristics:
This optimization is most beneficial for DataFrame inputs with many rows, where the repeated
iloccalls created a severe performance bottleneck.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-validate_gantt-mhcxyu68and push.