Skip to content

Dask engine can only have 4 partitions - Pull Requests #2492  #2498

Closed
@zhfanrui

Description

@zhfanrui

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS release 6.5 (Final)
  • Modin version (modin.__version__): 0.8.2 installed from conda-forge channel
  • Python version: 3.7.3
  • Code we can use to reproduce:
from distributed import Client
client = Client(n_workers=10)
import modin.pandas as pd
df = pd.read_csv(filename)

Describe the problem

Hi there, I recently got a really huge CSV file to read and found that no matter how I declare the number of the worker by using

from distributed import Client
client = Client(n_workers=10)
import modin.pandas as pd

The partition of the Dataframe object is still 4 causing the efficiency really low.

So I reviewed some of the code briefly (actually it takes me a whole afternoon to track the problem) and found pd.DEFAULT_NPARTITIONS is 4 defaulted (in file modin/modin/pandas/__init__.py) and the init function (_update_engine) didn't change the value of it when using Dask engine.

Hence, I change pd.DEFAULT_NPARTITIONS after importing pd by

pd.DEFAULT_NPARTITIONS = 10

and it works well!

So I decided to add a line of code to solve this problem (See Pull Requests #2492 ). However, it failed in test_read_csv_datetime. Really confused about that.

Source code / logs

Metadata

Metadata

Assignees

Labels

bug 🦗Something isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions