Skip to content

EDA: Report of Comparing Dataframs (create_diff_report) #702

Open
@jinglinpeng

Description

@jinglinpeng

Is your feature request related to a problem? Please describe.
Create a report to compare dataframes. The report is like sweetviz and our create_report function.

Describe the solution you'd like
The API is similar to create_report and is as follows:

create_diff_report(
    dfs: Union[List[DataFrame], Dict[str, DataFrame]],    
    config: Optional[Dict[str, Any]] = None,
    display: Optional[List[str]] = None,
    title: Optional[str] = "DataFrame Difference Report by DataPrep",
    mode: Optional[str] = "basic",
    progress: bool = True, )

The dfs is a list of dataframes or a dict of dataframes. E.g., user can call create_diff_report([df1, df2]) or create_diff_report({'train': df1, 'test': df2}). In the former case df is named as 'df1', 'df2'. In the later case the key is the name of the dataframe.

The layout of this function is similar to create_report. It has the following sections:

1. Overview. The overview section is like the overview in create_report. The content is from plot_diff([df1, df2]), as shown in the following figure.
image

2. Variables
The layout is similar to the Variables section in create_report, or
image
The difference is that:

  1. for the content we need to change the single dataframe statistics to multiple dataframes statistics. The layout is like what we did in plot_diff([df1, df2], x):
    image
  2. for the fig we need to change it to the fig of distribution comparison, e.g., show hist comparison for numerical column and bar chart comparison for categorical column. The following figs show the hist comparison and bar chart comparison fig:
    image
  3. In show details button, we change each tab to its multiple dataframes version.

3. ...To be continued

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions