Skip to content

potentially relevant usage patterns / targets for a developer-focused API #71

@rgommers

Description

@rgommers

In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on pandas.

Top 10 listed:

image

Seaborn

Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of isinstance checks (on pd.DataFrame, pd.Series).

Folium

just a single non-test usage, in pd.py:

def validate_location(location):  # noqa: C901
    "...J
    if isinstance(location, np.ndarray) \
            or (pd is not None and isinstance(location, pd.DataFrame)):
        location = np.squeeze(location).tolist()


def if_pandas_df_convert_to_numpy(obj):
    """Return a Numpy array from a Pandas dataframe.
    Iterating over a DataFrame has weird side effects, such as the first
    row being the column names. Converting to Numpy is more safe.
    """
    if pd is not None and isinstance(obj, pd.DataFrame):
        return obj.values
    else:
        return obj

PyJanitor

Interesting/unusual common pattern, which extends pd.DataFrame through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):

import pandas as pd
import pandas_flavor as pf

@pf.register_dataframe_method
def join_fasta(
    df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
    """
    Convenience method to join in a FASTA file as a column.
    """
    ...
    return df

Statsmodels

A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.

NetworkX

Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:

def to_networkx_graph(data, create_using=None, multigraph_input=False):
    """Make a NetworkX graph from a known data structure."""
        # Pandas DataFrame
    try:
        import pandas as pd

        if isinstance(data, pd.DataFrame):
            if data.shape[0] == data.shape[1]:
                try:
                    return nx.from_pandas_adjacency(data, create_using=create_using)
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame adjacency matrix."
                    raise nx.NetworkXError(msg) from err
            else:
                try:
                    return nx.from_pandas_edgelist(
                        data, edge_attr=True, create_using=create_using
                    )
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame edge-list."
                    raise nx.NetworkXError(msg) from err
    except ImportError:
        warnings.warn("pandas not found, skipping conversion test.", ImportWarning)


def from_pandas_adjacency(df, create_using=None):
    try:
        df = df[df.index]
    except Exception as err:
        missing = list(set(df.index).difference(set(df.columns)))
        msg = f"{missing} not in columns"
        raise nx.NetworkXError("Columns must match Indices.", msg) from err

    A = df.values
    G = from_numpy_array(A, create_using=create_using)

    nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
    return G

And using the .drop method in group.py:

def prominent_group(
    G, k, weight=None, C=None, endpoints=False, normalized=True, greedy=False
):
    import pandas as pd
    ...
    betweenness = pd.DataFrame.from_dict(PB)
    if C is not None:
        for node in C:
            # remove from the betweenness all the nodes not part of the group
            betweenness.drop(index=node, inplace=True)
            betweenness.drop(columns=node, inplace=True)
    CL = [node for _, node in sorted(zip(np.diag(betweenness), nodes), reverse=True)]

Perspective

A multi-language (streaming) viz and analytics library. The Python version uses pandas in core/pd.py. It uses a small but nontrivial amount of the API, including MultiIndex, CategoricalDtype, and time series functionality.

Scikit-learn

TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.

Matplotlib

Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions