Skip to content

PERF: nancorr_spearman #41857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -858,6 +858,7 @@ Performance improvements
- Performance improvement in :meth:`Series.isin` for nullable data types (:issue:`38340`)
- Performance improvement in :meth:`DataFrame.fillna` with ``method="pad|backfill"`` for nullable floating and nullable integer dtypes (:issue:`39953`)
- Performance improvement in :meth:`DataFrame.corr` for ``method=kendall`` (:issue:`28329`)
- Performance improvement in :meth:`DataFrame.corr` for ``method=spearman`` (:issue:`40956`)
- Performance improvement in :meth:`.Rolling.corr` and :meth:`.Rolling.cov` (:issue:`39388`)
- Performance improvement in :meth:`.RollingGroupby.corr`, :meth:`.ExpandingGroupby.corr`, :meth:`.ExpandingGroupby.corr` and :meth:`.ExpandingGroupby.cov` (:issue:`39591`)
- Performance improvement in :func:`unique` for object data type (:issue:`37615`)
Expand Down
95 changes: 50 additions & 45 deletions pandas/_libs/algos.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -383,8 +383,8 @@ def nancorr_spearman(ndarray[float64_t, ndim=2] mat, Py_ssize_t minp=1) -> ndarr
Py_ssize_t i, j, xi, yi, N, K
ndarray[float64_t, ndim=2] result
ndarray[float64_t, ndim=2] ranked_mat
ndarray[float64_t, ndim=1] maskedx
ndarray[float64_t, ndim=1] maskedy
ndarray[float64_t, ndim=1] rankedx, rankedy
float64_t[::1] maskedx, maskedy
ndarray[uint8_t, ndim=2] mask
int64_t nobs = 0
float64_t vx, vy, sumx, sumxx, sumyy, mean, divisor
Expand All @@ -399,56 +399,61 @@ def nancorr_spearman(ndarray[float64_t, ndim=2] mat, Py_ssize_t minp=1) -> ndarr

ranked_mat = np.empty((N, K), dtype=np.float64)

# Note: we index into maskedx, maskedy in loops up to nobs, but using N is safe
# here since N >= nobs and values are stored contiguously
maskedx = np.empty(N, dtype=np.float64)
maskedy = np.empty(N, dtype=np.float64)
for i in range(K):
ranked_mat[:, i] = rank_1d(mat[:, i], labels=labels_n)

for xi in range(K):
for yi in range(xi + 1):
nobs = 0
# Keep track of whether we need to recompute ranks
all_ranks = True
for i in range(N):
all_ranks &= not (mask[i, xi] ^ mask[i, yi])
if mask[i, xi] and mask[i, yi]:
nobs += 1

if nobs < minp:
result[xi, yi] = result[yi, xi] = NaN
else:
maskedx = np.empty(nobs, dtype=np.float64)
maskedy = np.empty(nobs, dtype=np.float64)
j = 0

with nogil:
for xi in range(K):
for yi in range(xi + 1):
nobs = 0
# Keep track of whether we need to recompute ranks
all_ranks = True
for i in range(N):
all_ranks &= not (mask[i, xi] ^ mask[i, yi])
if mask[i, xi] and mask[i, yi]:
maskedx[j] = ranked_mat[i, xi]
maskedy[j] = ranked_mat[i, yi]
j += 1

if not all_ranks:
labels_nobs = np.zeros(nobs, dtype=np.int64)
maskedx = rank_1d(maskedx, labels=labels_nobs)
maskedy = rank_1d(maskedy, labels=labels_nobs)

mean = (nobs + 1) / 2.

# now the cov numerator
sumx = sumxx = sumyy = 0

for i in range(nobs):
vx = maskedx[i] - mean
vy = maskedy[i] - mean

sumx += vx * vy
sumxx += vx * vx
sumyy += vy * vy

divisor = sqrt(sumxx * sumyy)
maskedx[nobs] = ranked_mat[i, xi]
maskedy[nobs] = ranked_mat[i, yi]
nobs += 1

if divisor != 0:
result[xi, yi] = result[yi, xi] = sumx / divisor
else:
if nobs < minp:
result[xi, yi] = result[yi, xi] = NaN
else:
if not all_ranks:
with gil:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the gil here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank_1d can't be called with nogil. Perhaps some refactoring could allow calling some nogil rank_1d helper instead, but that would be a larger change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c, ok i think its worthile to make that nogil (but not in this PR), followon preferred.

# We need to slice back to nobs because rank_1d will
# require arrays of nobs length
labels_nobs = np.zeros(nobs, dtype=np.int64)
rankedx = rank_1d(np.array(maskedx)[:nobs],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah should really take a memory view (or have a helper function to do it)

labels=labels_nobs)
rankedy = rank_1d(np.array(maskedy)[:nobs],
labels=labels_nobs)
for i in range(nobs):
maskedx[i] = rankedx[i]
maskedy[i] = rankedy[i]

mean = (nobs + 1) / 2.

# now the cov numerator
sumx = sumxx = sumyy = 0

for i in range(nobs):
vx = maskedx[i] - mean
vy = maskedy[i] - mean

sumx += vx * vy
sumxx += vx * vx
sumyy += vy * vy

divisor = sqrt(sumxx * sumyy)

if divisor != 0:
result[xi, yi] = result[yi, xi] = sumx / divisor
else:
result[xi, yi] = result[yi, xi] = NaN

return result

Expand Down