potentially relevant usage patterns / targets for a developer-focused API

In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on `pandas`.

Top 10 listed:

<img width="251" alt="image" src="https://user-images.githubusercontent.com/98330/144474276-ea1b5f5f-b462-4339-a032-b904ea2354cb.png">

### Seaborn

Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of `isinstance` checks (on `pd.DataFrame`, `pd.Series`).

- [`seaborn/_core.py`](https://github.com/mwaskom/seaborn/blob/a0f7bf881e22950501fe01feadfad2e30a2b748d/seaborn/_core.py): `Series`, `to_numeric`
- [`seaborn/matrix.py`](https://github.com/mwaskom/seaborn/blob/091f4c0e4f3580a8060de5596fa64c1ff9454dc5/seaborn/matrix.py): `DataFrame`, `isnull`, `.index.equals`, `.column.equals`, 
- [`seaborn/utils.py`](https://github.com/mwaskom/seaborn/blob/091f4c0e4f3580a8060de5596fa64c1ff9454dc5/seaborn/utils.py): `DataFrame`, `Categorical`, `notnull`
- [`seaborn/regression.py`](https://github.com/mwaskom/seaborn/blob/091f4c0e4f3580a8060de5596fa64c1ff9454dc5/seaborn/regression.py): only `pd.notnull`
- [`seaborn/distributions.py`](https://github.com/mwaskom/seaborn/blob/091f4c0e4f3580a8060de5596fa64c1ff9454dc5/seaborn/distributions.py): `.values`, `.copy`, `.iloc`, `.loc`, `.reset_index`, `.index`, `set_index`, `MultiIndex.from_arrays`, `Index`, `Series`, `concat`, `merge`
- [`seaborn/relational.py`](https://github.com/mwaskom/seaborn/blob/091f4c0e4f3580a8060de5596fa64c1ff9454dc5/seaborn/relational.py): `DataFrame`, `merge`, `.rename`
- [`seaborn/categorical.py`](https://github.com/mwaskom/seaborn/blob/091f4c0e4f3580a8060de5596fa64c1ff9454dc5/seaborn/categorical.py): `DataFrame`, `iteritems`, `Series`, `notnull`, `option_context`, `isnull`, `groupby`, `get_group`, 
- [`seaborn/_statistics.py`](https://github.com/mwaskom/seaborn/blob/091f4c0e4f3580a8060de5596fa64c1ff9454dc5/seaborn/_statistics.py): only `Series`


### Folium

just a single non-test usage, in [pd.py](https://github.com/python-visualization/folium/blob/d697154c0fe6c2551889cbfc535b1e4bfb8a3be1/folium/utilities.py):
```python

def validate_location(location):  # noqa: C901
    "...J
    if isinstance(location, np.ndarray) \
            or (pd is not None and isinstance(location, pd.DataFrame)):
        location = np.squeeze(location).tolist()


def if_pandas_df_convert_to_numpy(obj):
    """Return a Numpy array from a Pandas dataframe.
    Iterating over a DataFrame has weird side effects, such as the first
    row being the column names. Converting to Numpy is more safe.
    """
    if pd is not None and isinstance(obj, pd.DataFrame):
        return obj.values
    else:
        return obj

```


### PyJanitor

Interesting/unusual common pattern, which extends `pd.DataFrame` through [pandas_flavor](https://github.com/Zsailer/pandas_flavor) with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):

```python
import pandas as pd
import pandas_flavor as pf

@pf.register_dataframe_method
def join_fasta(
    df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
    """
    Convenience method to join in a FASTA file as a column.
    """
    ...
    return df
```


### Statsmodels

A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.


### NetworkX

Mostly just conversions to support pandas dataframes as input/output values. E.g., from [convert.py](https://github.com/networkx/networkx/blob/3858c531213b88564b3390f34eb9a9c2c0ae5b0d/networkx/convert.py) and [convert_matrix.py](https://github.com/networkx/networkx/blob/3858c531213b88564b3390f34eb9a9c2c0ae5b0d/networkx/convert_matrix.py):


```python
def to_networkx_graph(data, create_using=None, multigraph_input=False):
    """Make a NetworkX graph from a known data structure."""
        # Pandas DataFrame
    try:
        import pandas as pd

        if isinstance(data, pd.DataFrame):
            if data.shape[0] == data.shape[1]:
                try:
                    return nx.from_pandas_adjacency(data, create_using=create_using)
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame adjacency matrix."
                    raise nx.NetworkXError(msg) from err
            else:
                try:
                    return nx.from_pandas_edgelist(
                        data, edge_attr=True, create_using=create_using
                    )
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame edge-list."
                    raise nx.NetworkXError(msg) from err
    except ImportError:
        warnings.warn("pandas not found, skipping conversion test.", ImportWarning)


def from_pandas_adjacency(df, create_using=None):
    try:
        df = df[df.index]
    except Exception as err:
        missing = list(set(df.index).difference(set(df.columns)))
        msg = f"{missing} not in columns"
        raise nx.NetworkXError("Columns must match Indices.", msg) from err

    A = df.values
    G = from_numpy_array(A, create_using=create_using)

    nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
    return G
```

And using the `.drop` method in [group.py](https://github.com/networkx/networkx/blob/5bc077c27155649f2503150a2623f49de093b332/networkx/algorithms/centrality/group.py):
```python
def prominent_group(
    G, k, weight=None, C=None, endpoints=False, normalized=True, greedy=False
):
    import pandas as pd
    ...
    betweenness = pd.DataFrame.from_dict(PB)
    if C is not None:
        for node in C:
            # remove from the betweenness all the nodes not part of the group
            betweenness.drop(index=node, inplace=True)
            betweenness.drop(columns=node, inplace=True)
    CL = [node for _, node in sorted(zip(np.diag(betweenness), nodes), reverse=True)]
```


### Perspective

A multi-language (streaming) viz and analytics library. The Python version uses pandas in [`core/pd.py`](https://github.com/finos/perspective/blob/e95dc5249542008a7b2cd84d6a61886610e0542a/python/perspective/perspective/core/data/pd.py). It uses a small but nontrivial amount of the API, including `MultiIndex`, `CategoricalDtype`, and time series functionality.


### Scikit-learn

TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.

### Matplotlib

Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

potentially relevant usage patterns / targets for a developer-focused API #71

Seaborn

Folium

PyJanitor

Statsmodels

NetworkX

Perspective

Scikit-learn

Matplotlib

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

potentially relevant usage patterns / targets for a developer-focused API #71

Description

Seaborn

Folium

PyJanitor

Statsmodels

NetworkX

Perspective

Scikit-learn

Matplotlib

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions