Skip to content

PERF: Avoid intermediate ndarray[object] array when constructing a pyarrow-backed string data from untyped data #64429

@mroeschke

Description

@mroeschke
In [1]: import pandas as pd

In [2]: pd.Series(["a"]).array
> /pandas/core/construction.py(695)sanitize_array()
-> subarr = maybe_convert_platform(data)
(Pdb) data
['a']
(Pdb) n
> /pandas/core/construction.py(696)sanitize_array()
-> if subarr.dtype == object:
(Pdb) subarr
array(['a'], dtype=object)
(Pdb) c
Out[5]: 
<ArrowStringArray>
['a']
Length: 1, dtype: str

In [3]: pd.Series(["a"], dtype=pd.StringDtype("pyarrow"))
Out[3]: 
0    a
dtype: string

maybe_convert_platform calls construct_1d_object_array_from_listlike to convert the list to ndarray[object] for further inference. I wonder if we can perform type inference without converting to object such that for strings we can pass the list directly to ArrowStringArray._from_sequence_of_strings

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityPerformanceMemory or execution speed performanceStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions