-
Notifications
You must be signed in to change notification settings - Fork 21
Description
In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on pandas.
Top 10 listed:
Seaborn
Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of isinstance checks (on pd.DataFrame, pd.Series).
seaborn/_core.py:Series,to_numericseaborn/matrix.py:DataFrame,isnull,.index.equals,.column.equals,seaborn/utils.py:DataFrame,Categorical,notnullseaborn/regression.py: onlypd.notnullseaborn/distributions.py:.values,.copy,.iloc,.loc,.reset_index,.index,set_index,MultiIndex.from_arrays,Index,Series,concat,mergeseaborn/relational.py:DataFrame,merge,.renameseaborn/categorical.py:DataFrame,iteritems,Series,notnull,option_context,isnull,groupby,get_group,seaborn/_statistics.py: onlySeries
Folium
just a single non-test usage, in pd.py:
def validate_location(location): # noqa: C901
"...J
if isinstance(location, np.ndarray) \
or (pd is not None and isinstance(location, pd.DataFrame)):
location = np.squeeze(location).tolist()
def if_pandas_df_convert_to_numpy(obj):
"""Return a Numpy array from a Pandas dataframe.
Iterating over a DataFrame has weird side effects, such as the first
row being the column names. Converting to Numpy is more safe.
"""
if pd is not None and isinstance(obj, pd.DataFrame):
return obj.values
else:
return objPyJanitor
Interesting/unusual common pattern, which extends pd.DataFrame through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):
import pandas as pd
import pandas_flavor as pf
@pf.register_dataframe_method
def join_fasta(
df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
"""
Convenience method to join in a FASTA file as a column.
"""
...
return dfStatsmodels
A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.
NetworkX
Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:
def to_networkx_graph(data, create_using=None, multigraph_input=False):
"""Make a NetworkX graph from a known data structure."""
# Pandas DataFrame
try:
import pandas as pd
if isinstance(data, pd.DataFrame):
if data.shape[0] == data.shape[1]:
try:
return nx.from_pandas_adjacency(data, create_using=create_using)
except Exception as err:
msg = "Input is not a correct Pandas DataFrame adjacency matrix."
raise nx.NetworkXError(msg) from err
else:
try:
return nx.from_pandas_edgelist(
data, edge_attr=True, create_using=create_using
)
except Exception as err:
msg = "Input is not a correct Pandas DataFrame edge-list."
raise nx.NetworkXError(msg) from err
except ImportError:
warnings.warn("pandas not found, skipping conversion test.", ImportWarning)
def from_pandas_adjacency(df, create_using=None):
try:
df = df[df.index]
except Exception as err:
missing = list(set(df.index).difference(set(df.columns)))
msg = f"{missing} not in columns"
raise nx.NetworkXError("Columns must match Indices.", msg) from err
A = df.values
G = from_numpy_array(A, create_using=create_using)
nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
return GAnd using the .drop method in group.py:
def prominent_group(
G, k, weight=None, C=None, endpoints=False, normalized=True, greedy=False
):
import pandas as pd
...
betweenness = pd.DataFrame.from_dict(PB)
if C is not None:
for node in C:
# remove from the betweenness all the nodes not part of the group
betweenness.drop(index=node, inplace=True)
betweenness.drop(columns=node, inplace=True)
CL = [node for _, node in sorted(zip(np.diag(betweenness), nodes), reverse=True)]Perspective
A multi-language (streaming) viz and analytics library. The Python version uses pandas in core/pd.py. It uses a small but nontrivial amount of the API, including MultiIndex, CategoricalDtype, and time series functionality.
Scikit-learn
TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.
Matplotlib
Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.
