-
Notifications
You must be signed in to change notification settings - Fork 665
Open
Labels
question ❓Questions about ModinQuestions about Modin
Description
import logging
logger = logging.getLogger(__name__)
def log_partitions(input_df):
partitions = input_df._query_compiler._modin_frame._partitions
# Iterate through the partition matrix
logger.info(f"Row partitions: {len(partitions)}")
row_index = 0
for partition_row in partitions:
print(f"Row {row_index} has Column partitions {len(partition_row)}")
col_index = 0
for partition in partition_row:
print(f"DF Shape {partition.get().shape} is for row {row_index} column {col_index}")
col_index = col_index + 1
row_index = row_index + 1
import modin.pandas as pd
df = pd.DataFrame({"col": ["A,B,C", "X,Y,Z", "1,2,3"]})
log_partitions(df)
for i in range(3): # Adding columns one by one
df[f"split_{i}"] = df["col"].str.split(",").str[i]
print(df)
log_partitions(df)
This gives output
Row 0 has Column partitions 1
DF Shape (3, 1) is for row 0 column 0
col split_0 split_1 split_2
0 A,B,C A B C
1 X,Y,Z X Y Z
2 1,2,3 1 2 3
Row 0 has Column partitions 4
DF Shape (3, 1) is for row 0 column 0
DF Shape (3, 1) is for row 0 column 1
DF Shape (3, 1) is for row 0 column 2
DF Shape (3, 1) is for row 0 column 3
Modin is creating new partitions for each column addition. This is the sample code to reproduce the issue, the real issue comes in where this happens in a pipeline step , after creating multiple partitions if the next step works on multiple columns belongs to different partitions the performance is very bad. What is the solution for this ?
Thanks in advance
Sumukhagc
Metadata
Metadata
Assignees
Labels
question ❓Questions about ModinQuestions about Modin