⚡️ Speed up function `validate_gantt` by 58x #5386

misrasaurabh1 · 2025-10-30T06:23:32Z

📄 5,759% (57.59x) speedup for `validate_gantt` in `plotly/figure_factory/_gantt.py`

⏱️ Runtime : 154 milliseconds → 2.63 milliseconds (best of 246 runs)

📝 Explanation and details

The optimization achieves a 58x speedup by eliminating the major performance bottleneck in pandas DataFrame processing.

Key optimizations:

Pre-fetch column data as numpy arrays: The original code used df.iloc[index][key] for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using df[key].values and stores it in a dictionary, then uses direct numpy array indexing columns[key][index] inside the loop.
Use actual DataFrame columns: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses list(df.columns) to get only the actual column names.

Why this is dramatically faster:

df.iloc[index][key] creates temporary pandas Series objects and involves complex indexing logic for each cell
Direct numpy array indexing columns[key][index] is orders of magnitude faster
The line profiler shows the original df.iloc line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms)

Performance characteristics:

Large DataFrames see massive gains: 8000%+ speedup on 1000-row DataFrames
Small DataFrames: 40-50% faster
List inputs: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance
Empty DataFrames: Some slowdown due to upfront column extraction, but still fast overall

This optimization is most beneficial for DataFrame inputs with many rows, where the repeated iloc calls created a severe performance bottleneck.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 39 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
# function to test
from plotly import exceptions, optional_imports
from plotly.figure_factory._gantt import validate_gantt

pd = optional_imports.get_module("pandas")

REQUIRED_GANTT_KEYS = ["Task", "Start", "Finish"]
from plotly.figure_factory._gantt import validate_gantt

# --- BASIC TEST CASES ---

def test_valid_list_of_dicts():
    # Test a valid list of dictionaries with required keys
    input_data = [
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04"}
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.88μs -> 1.95μs (3.54% slower)

def test_valid_dataframe():
    # Test a valid pandas DataFrame with required keys
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 142μs -> 99.9μs (42.6% faster)

def test_valid_list_with_extra_keys():
    # Test list of dicts with extra keys
    input_data = [
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04", "Resource": "Y"}
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.57μs -> 1.70μs (7.77% slower)

def test_valid_dataframe_with_extra_keys():
    # Test DataFrame with extra columns
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Finish": "2020-01-04", "Resource": "Y"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 160μs -> 109μs (46.6% faster)

# --- EDGE TEST CASES ---

def test_missing_required_key_in_list():
    # Test list of dicts missing a required key
    input_data = [
        {"Task": "A", "Start": "2020-01-01"},  # Missing "Finish"
    ]
    # Should NOT raise: list input is not validated for keys
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.54μs -> 1.67μs (7.83% slower)

def test_missing_required_key_in_dataframe():
    # Test DataFrame missing a required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01"}  # Missing "Finish"
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 27.2μs -> 27.1μs (0.402% faster)

def test_empty_list():
    # Test empty list input
    input_data = []
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.39μs -> 2.40μs (0.292% slower)


def test_input_is_not_list_or_dataframe():
    # Test input that is neither a list nor a pandas DataFrame
    input_data = "Not a list or DataFrame"
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.58μs -> 2.64μs (2.27% slower)

def test_dataframe_with_no_rows():
    # Test DataFrame with correct columns but no rows
    import pandas as pd
    df = pd.DataFrame(columns=["Task", "Start", "Finish"])
    codeflash_output = validate_gantt(df); result = codeflash_output # 27.0μs -> 99.0μs (72.8% slower)

def test_dataframe_with_extra_rows_and_missing_keys():
    # Test DataFrame with extra columns, but missing one required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Resource": "X"},
        {"Task": "B", "Start": "2020-01-03", "Resource": "Y"}
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 26.3μs -> 26.8μs (2.13% slower)

def test_list_with_dict_missing_all_keys():
    # Test list of dicts missing all required keys
    input_data = [
        {"Resource": "X"}
    ]
    # Should NOT raise: list input is not validated for keys
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 1.61μs -> 1.87μs (13.9% slower)

def test_dataframe_with_only_required_keys():
    # Test DataFrame with only required keys
    import pandas as pd
    df = pd.DataFrame([
        {"Task": "A", "Start": "2020-01-01", "Finish": "2020-01-02"}
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 108μs -> 98.6μs (9.92% faster)

# --- LARGE SCALE TEST CASES ---

def test_large_list_of_dicts():
    # Test a large list of dicts (1000 elements)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(1000)
    ]
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 2.30μs -> 2.47μs (6.69% slower)

def test_large_dataframe():
    # Test a large DataFrame (1000 rows)
    import pandas as pd
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(1000)
    ])
    codeflash_output = validate_gantt(df); result = codeflash_output # 35.9ms -> 429μs (8268% faster)
    for i in range(1000):
        pass

def test_large_dataframe_missing_key():
    # Test a large DataFrame missing one required key
    import pandas as pd
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}"}  # Missing "Finish"
        for i in range(1000)
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 31.1μs -> 30.0μs (3.66% faster)

def test_large_list_with_non_dict_first_element():
    # Test large list with first element not a dict
    input_data = ["Not a dict"] + [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(999)
    ]
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 2.91μs -> 2.96μs (1.69% slower)

def test_large_list_with_non_dict_later_element():
    # Test large list where a later element is not a dict (should NOT raise)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2020-01-{i%30+1:02d}", "Finish": f"2020-02-{i%28+1:02d}"}
        for i in range(999)
    ] + ["Not a dict"]
    # Should NOT raise: only first element is checked
    codeflash_output = validate_gantt(input_data); result = codeflash_output # 2.18μs -> 2.34μs (7.01% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import sys
import types

import pandas as pd
# imports
import pytest  # used for our unit tests
# function to test
# (copied verbatim from prompt, for test completeness)
from plotly import exceptions, optional_imports
from plotly.figure_factory._gantt import validate_gantt

pd = optional_imports.get_module("pandas")

REQUIRED_GANTT_KEYS = ["Task", "Start", "Finish"]
from plotly.figure_factory._gantt import validate_gantt

# unit tests

if pd is None:
    import pytest


# --- Basic Test Cases ---

def test_valid_list_of_dicts():
    # Valid input: list of dictionaries
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.87μs -> 1.94μs (3.86% slower)

def test_valid_dataframe():
    # Valid input: DataFrame with required columns
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 150μs -> 106μs (42.1% faster)

# --- Edge Test Cases ---

def test_missing_required_keys_in_dataframe():
    # DataFrame missing "Finish" column
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01"},
        {"Task": "B", "Start": "2023-01-02"}
    ])
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(df) # 26.0μs -> 25.9μs (0.424% faster)

def test_missing_required_keys_in_list_of_dicts():
    # List of dicts missing "Finish" key
    input_data = [
        {"Task": "A", "Start": "2023-01-01"},
        {"Task": "B", "Start": "2023-01-02"}
    ]
    # This should not raise, as the function does not check keys for lists
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.53μs -> 1.75μs (12.5% slower)

def test_empty_list():
    # Empty list should raise
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt([]) # 1.76μs -> 1.81μs (2.70% slower)


def test_non_list_non_dataframe_input():
    # Input is neither a list nor a DataFrame
    input_data = "not a list or dataframe"
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 1.68μs -> 1.64μs (2.56% faster)

def test_dataframe_with_extra_columns():
    # DataFrame with extra columns should still work
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02", "Extra": 123}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 173μs -> 117μs (48.3% faster)

def test_list_of_dicts_with_extra_keys():
    # List of dicts with extra keys should pass
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02", "Extra": 123}
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.55μs -> 1.76μs (12.2% slower)

def test_dataframe_with_wrong_column_types():
    # DataFrame with columns named correctly but with wrong types in values
    df = pd.DataFrame([
        {"Task": None, "Start": 123, "Finish": []}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 137μs -> 91.8μs (49.7% faster)

def test_list_with_first_dict_rest_non_dicts():
    # Only the first element is checked for being a dict
    input_data = [{"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"}, 123, "string"]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.58μs -> 1.75μs (9.48% slower)

def test_dataframe_with_no_rows():
    # DataFrame with correct columns but no rows
    df = pd.DataFrame(columns=REQUIRED_GANTT_KEYS)
    codeflash_output = validate_gantt(df); output = codeflash_output # 23.9μs -> 93.8μs (74.5% slower)

def test_list_of_dicts_with_empty_dict():
    # List with an empty dictionary as first element
    input_data = [{}]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 1.43μs -> 1.85μs (22.8% slower)

# --- Large Scale Test Cases ---

def test_large_list_of_dicts():
    # Large list of dicts (1000 elements)
    input_data = [
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"}
        for i in range(1, 1001)
    ]
    codeflash_output = validate_gantt(input_data); output = codeflash_output # 2.02μs -> 2.25μs (10.2% slower)

def test_large_dataframe():
    # Large DataFrame (1000 rows)
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"}
        for i in range(1, 1001)
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 35.7ms -> 433μs (8135% faster)
    for i in range(1000):
        pass

def test_large_dataframe_with_extra_columns():
    # Large DataFrame with extra columns
    df = pd.DataFrame([
        {"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}", "Extra": i}
        for i in range(1, 1001)
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 81.0ms -> 511μs (15734% faster)
    for i in range(1000):
        pass

def test_large_list_with_non_dict_first_element():
    # Large list, first element not a dict
    input_data = [0] + [{"Task": f"Task{i}", "Start": f"2023-01-{i:02d}", "Finish": f"2023-01-{i+1:02d}"} for i in range(1, 999)]
    with pytest.raises(exceptions.PlotlyError) as excinfo:
        validate_gantt(input_data) # 3.19μs -> 3.32μs (3.82% slower)

def test_large_empty_dataframe():
    # Large DataFrame with correct columns but zero rows
    df = pd.DataFrame(columns=REQUIRED_GANTT_KEYS)
    codeflash_output = validate_gantt(df); output = codeflash_output # 25.0μs -> 96.7μs (74.2% slower)

# --- Determinism and Robustness ---

def test_determinism_multiple_calls():
    # Multiple calls with same input should give same output
    input_data = [
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ]
    codeflash_output = validate_gantt(input_data); output1 = codeflash_output # 1.61μs -> 1.88μs (14.0% slower)
    codeflash_output = validate_gantt(input_data); output2 = codeflash_output # 523ns -> 586ns (10.8% slower)

def test_dataframe_column_order():
    # DataFrame with columns in different order
    df = pd.DataFrame([
        {"Finish": "2023-01-02", "Start": "2023-01-01", "Task": "A"}
    ])
    codeflash_output = validate_gantt(df); output = codeflash_output # 107μs -> 96.7μs (10.7% faster)

def test_dataframe_with_index():
    # DataFrame with custom index
    df = pd.DataFrame([
        {"Task": "A", "Start": "2023-01-01", "Finish": "2023-01-02"},
        {"Task": "B", "Start": "2023-01-02", "Finish": "2023-01-03"}
    ], index=["x", "y"])
    codeflash_output = validate_gantt(df); output = codeflash_output # 137μs -> 91.4μs (50.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-validate_gantt-mhcxyu68 and push.

The optimization achieves a **58x speedup** by eliminating the major performance bottleneck in pandas DataFrame processing. **Key optimizations:** 1. **Pre-fetch column data as numpy arrays**: The original code used `df.iloc[index][key]` for each cell access, which triggers pandas' slow row-based indexing mechanism. The optimized version extracts all column data upfront using `df[key].values` and stores it in a dictionary, then uses direct numpy array indexing `columns[key][index]` inside the loop. 2. **More efficient key validation**: Replaced the nested loop checking for missing keys with a single list comprehension `missing_keys = [key for key in REQUIRED_GANTT_KEYS if key not in df]`. 3. **Use actual DataFrame columns**: Instead of iterating over the DataFrame object itself (which includes metadata), the code now uses `list(df.columns)` to get only the actual column names. **Why this is dramatically faster:** - `df.iloc[index][key]` creates temporary pandas Series objects and involves complex indexing logic for each cell - Direct numpy array indexing `columns[key][index]` is orders of magnitude faster - The line profiler shows the original `df.iloc` line consumed 96.8% of execution time (523ms), while the optimized dictionary comprehension takes only 44.9% (4.2ms) **Performance characteristics:** - **Large DataFrames see massive gains**: 8000%+ speedup on 1000-row DataFrames - **Small DataFrames**: 40-50% faster - **List inputs**: Slight slowdown (3-13%) due to additional validation overhead, but still microsecond-level performance - **Empty DataFrames**: Some slowdown due to upfront column extraction, but still fast overall This optimization is most beneficial for DataFrame inputs with many rows, where the repeated `iloc` calls created a severe performance bottleneck.

camdecoster · 2025-10-30T15:27:59Z

Thanks for the PR! Could you please add test coverage or demonstrate that test coverage is already provided? Some tests failed CI, but I think that's unrelated to your changes.

misrasaurabh1 · 2025-10-30T19:37:17Z

@camdecoster just added a test for it. fixing the formatting issue now

camdecoster

It looks like there could be some redundant tests with this test file. Could you please double check and remove any redundant tests from your PR?

camdecoster · 2025-11-17T20:30:21Z

tests/test_optional/test_figure_factory/test_validate_gantt.py

+    assert all(isinstance(x, dict) for x in result)
+
+
+@pytest.mark.skipif(pd is None, reason="pandas is not available")


Could you please remove the skipif calls? Based on CI, Pandas will always be defined.

codeflash-ai bot and others added 3 commits October 30, 2025 04:46

Apply suggestion from @misrasaurabh1

6be6284

Apply suggestion from @misrasaurabh1

9e2a2f0

misrasaurabh1 changed the title ~~⚡️ Speed up function validate_gantt by 58x~~ ⚡️ Speed up function validate_gantt by 58x Oct 30, 2025

misrasaurabh1 changed the title ~~⚡️ Speed up function validate_gantt by 58x~~ ⚡️ Speed up function validate_gantt by 58x Oct 30, 2025

adding validate_gantt tests file

7ddb02b

mashraf-222 added 2 commits October 30, 2025 22:40

fix formatting

666dcc2

fixing formatting

ef98a70

camdecoster reviewed Nov 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

⚡️ Speed up function `validate_gantt` by 58x #5386

⚡️ Speed up function `validate_gantt` by 58x #5386

Uh oh!

misrasaurabh1 commented Oct 30, 2025

Uh oh!

camdecoster commented Oct 30, 2025

Uh oh!

misrasaurabh1 commented Oct 30, 2025

Uh oh!

camdecoster left a comment

Uh oh!

camdecoster Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		assert all(isinstance(x, dict) for x in result)


		@pytest.mark.skipif(pd is None, reason="pandas is not available")

Uh oh!

⚡️ Speed up function validate_gantt by 58x #5386

Are you sure you want to change the base?

⚡️ Speed up function validate_gantt by 58x #5386

Uh oh!

Conversation

misrasaurabh1 commented Oct 30, 2025

📄 5,759% (57.59x) speedup for validate_gantt in plotly/figure_factory/_gantt.py

📝 Explanation and details

Uh oh!

camdecoster commented Oct 30, 2025

Uh oh!

misrasaurabh1 commented Oct 30, 2025

Uh oh!

camdecoster left a comment

Choose a reason for hiding this comment

Uh oh!

camdecoster Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Speed up function `validate_gantt` by 58x #5386

⚡️ Speed up function `validate_gantt` by 58x #5386

📄 5,759% (57.59x) speedup for `validate_gantt` in `plotly/figure_factory/_gantt.py`