Skip to content

Convert the get_indexes feature to use Ibis #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks
mvanwyk opened this issue Feb 10, 2025 · 0 comments
Closed
4 tasks

Convert the get_indexes feature to use Ibis #92

mvanwyk opened this issue Feb 10, 2025 · 0 comments
Assignees

Comments

@mvanwyk
Copy link
Contributor

mvanwyk commented Feb 10, 2025

See here for a description of Index Plots

The get_indexes function calculates index values and is typically used with the index plot, rather than calling the function directly. The function calculates indexes via Pandas, which is slow and needs the data to be loaded into the memory. By using Ibis, we can quickly push these calculations to the database.

The index plot works by comparing a subgroup versus the total group. For instance, if we broke customers into Heavy, Medium and Light segments, we might ask the question, what does the Light group buy more (or less) of than the other groups. The typical way you would do this is to look at the Light group's % of spend on a category (eg Music) versus the % of spend for all customers. For instance, the Light group might spend 10% of their spend on the Music category versus an average across all customers of 5%. To get the index, we then take (10% / 5%) * 100 and get an index of 200. Typically, an index >= 120 is considered significantly overindexed. An index <= 80 is considered significantly underindexed.

Presently, to identify the "Light" segment, you would pass in the Pandas index locations of the rows where a customer has been segmented as "Light" (see the code example below df_index_filter=df["segment_name"] == "Light"). This won't work with Ibis as Ibis works with database table-like objects, and they don't have the concept of an index, so we will have to change it.

from pyretailscience.standard_graphs import index_plot

index_plot(
    df,
    df_index_filter=df["segment_name"] == "Light",
    value_col="unit_price",
    group_col="category_0_name",
)

My thinking is to split it into parameters. An index_col and value_to_index., parameters. Eg the below. Let me know if you think the naming is confusing.

from pyretailscience.standard_graphs import index_plot

index_plot(
    df,
    index_col="segment_name",
    value_to_index="Light",
    value_col="unit_price",
    group_col="category_0_name",
)

I think the rest should be relatively straightforward.

Notes

  • The user should be able to pass in a data frame or an Ibis table. If they pass in a data frame, then convert it to an Ibis table via ibis.memtable(df)
  • If necessary, extend the unit tests to handle any edge cases that are not currently covered
  • Please update the index plots section of analysis_modules.md with the new version of the code
  • Update the index_plot function to make it compatible with the updated get_indexes function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants