Closed
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS release 6.5 (Final)
- Modin version (
modin.__version__
): 0.8.2 installed from conda-forge channel - Python version: 3.7.3
- Code we can use to reproduce:
from distributed import Client
client = Client(n_workers=10)
import modin.pandas as pd
df = pd.read_csv(filename)
Describe the problem
Hi there, I recently got a really huge CSV file to read and found that no matter how I declare the number of the worker by using
from distributed import Client
client = Client(n_workers=10)
import modin.pandas as pd
The partition of the Dataframe object is still 4 causing the efficiency really low.
So I reviewed some of the code briefly (actually it takes me a whole afternoon to track the problem) and found pd.DEFAULT_NPARTITIONS
is 4 defaulted (in file modin/modin/pandas/__init__.py
) and the init function (_update_engine
) didn't change the value of it when using Dask engine.
Hence, I change pd.DEFAULT_NPARTITIONS
after importing pd
by
pd.DEFAULT_NPARTITIONS = 10
and it works well!
So I decided to add a line of code to solve this problem (See Pull Requests #2492 ). However, it failed in test_read_csv_datetime
. Really confused about that.